NGS Exercise.4
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.3 | NGS Exercise.3b | NGS Exercise.3c | NGS Exercise.3d | NGS Exercise.3e ]
# updated 2014 version
Compute read coverage from the mapping results
Contents
- 1 Introduction
- 2 about the notion of mapability and corresponding data
- 3 compute 'per base' genome coverage with samtools
- 4 compute 'per base' genome coverage with bedTools genomecov
- 5 compute the smoothed coverage in 10kb interval
- 6 count how much of chr21 non-gap regions was not sequenced with all reads
- 7 Note about coverage modes in RNA-Seq experiments
- 8 download exercise files
Introduction
After aligning reads to a reference genome, one should first perform a coverage analysis.
The coverage analysis results in two measurements
- The coverage depth is the count of reads that align to the reference genome at each given position; it includes reads that also align to other position (unless filtered accordingly).
- The coverage width reports the count of nucleotides in the reference genome that align to at least one NGS read; a number of reference positions are undefined (telomeres, centromeres) while other locations are highly degenerated (repeats or oligonucleotides); both types cannot be aligned to and will be absent from the coverage depth.
Knowing both the distribution of coverage depths and the genome positions not sequenced is important when doing NGS DNA variant analysis. It is obvious that some parts of the genomes (exons!) will not be sequenced and therefore not analyzed for the presence of variants. These regions may bear variations but will not be called and will seem to be invariable in the end result file (no-calls)
A basic coverage analysis allows detecting regions of higher copy number (either true or apparent) in the genome as well as regions with less or no coverage (possibly deleted in the sequenced genome).
Although a true CNV analysis requires paired genome data like tumor and normal samples from the same individual, a 'basic' coverage analysis is possible with a single sample using read mapping information but will be affected by the 'mappability' effects explained above and should therefore be 'normalized' as compared to other related genomes or using local averaging methods (loess normalization / sliding window averaging).
Taking all reads mapped to the reference genome and piling them up (build a multiple sequence alignment from the mapped reads) allows computing the genome coverage. This is an important metric that will for instance indicate regions of about 50% additional coverage as compared to the median value (potential duplicated loci for one allele), regions of about-half average coverage (potential haploid deletions) and finally regions of no-coverage (potential homozygous deletions). All these results should of course be taken with a lot of salt as artifacts and mapping issues can also lead to similar results (see illustration below for the first exon of the DSCAM gene).
Various tools allow computing read coverage, we will only detail below a selection that allows specific measurements and are simple to use.
about the notion of mapability and corresponding data
Not all regions of the human genome get equal chance to be mapped by reads. This is an important and often underestimated issue with genome sequencing that led to the definition of terms like mapability.
Read mapability or alignability is defined as the probability of any given region to be efficiently sequenced by NGS sequencing. Mapability is not constant across the reference genome and is subject to various effects associated with sequence content (GC, oligomers, N-regions) but also to the existence of larger repeated loci. Excessive piling up of reads is indicative of poorly mapped region
where do I get mapability data
A 100bps fragment is reported with a value of 1 if unique in the genome, of 0.5 for fragments mapping to 2 alternate locations, and of 1/n for n alternate mapping fragments.
- web-page:<http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability>[1]
- direct-link: <http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/>[2]
http://genome.ucsc.edu/ENCODE/terms.html. For a filterable list of these files along with additional information such as release status, restriction dates, track description, methods, and metadata, please use the 'Downloadable Files' user interface at: http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability In addition to data files, this directory may also contain some or all of the following: * files.txt - a file listing each data file and its metadata. * supplemental - a directory containing any additional files provided by the submitting laboratory. * obsolete files - WARNING - Revoked and replaced data files may be present in this directory. Please make every effort to determine that you are using the latest version of the data. To do so access the recommended 'Downloadable Files' user interface via the URL provided above or check files.txt. If the data file is not listed in files.txt or has a tag of objStatus=revoked or replaced, then the file is obsolete. * releaseN directories - WARNING - These are only present on the genome-preview site. Please note that data files on genome-preview have not been verified or released to the public. They are subject to change without notice and should be used with caution. Name Last modified Size Description Parent Directory - files.txt 24-Feb-2012 14:19 3.9K md5sum.txt 26-Oct-2011 11:22 748 wgEncodeCrgMapabilityAlign24mer.bigWig 27-Apr-2010 06:44 5.0G wgEncodeCrgMapabilityAlign36mer.bigWig 27-Apr-2010 07:47 1.3G wgEncodeCrgMapabilityAlign40mer.bigWig 27-Apr-2010 08:49 1.2G wgEncodeCrgMapabilityAlign50mer.bigWig 27-Apr-2010 09:44 706M wgEncodeCrgMapabilityAlign75mer.bigWig 27-Apr-2010 10:26 281M wgEncodeCrgMapabilityAlign100mer.bigWig 27-Apr-2010 04:25 99M wgEncodeDacMapabilityConsensusExcludable.bed.gz 04-May-2011 14:08 4.6K wgEncodeDukeMapabilityRegionsExcludable.bed.gz 28-Mar-2011 16:16 17K wgEncodeDukeMapabilityUniqueness20bp.bigWig 15-Mar-2011 17:51 1.4G wgEncodeDukeMapabilityUniqueness35bp.bigWig 11-May-2011 13:39 929M
Can I compute mappability for my own reference genome (if not already available from UCSC)?
- Yes you can! using the GEM-library software and the fasta sequence of your genome. We provide a tutorial to do so in a separate page Create_a_mappability_track on our Wiki.
compute 'per base' genome coverage with samtools
Samtools is the main toolbox dedicated to SAM/BAM formatted files and is written in C++ and Perl. Samtools is complemented by similar tools including the popular Picard tools (written in Java) or the Broad Institute GATK (not used here due to license limitations).
REM: Samtools assumes diploid genomes, this might become an issue if you work on aneuploid species (plants) for which alternative tools exist. For haploid genomes (eg. bacteria, viruses) the consequences are less important and samtools can be used with some post hoc filtering.
#! /usr/bin/env bash ## script: samtools_coverage.sh ## ©SP-BITS, 2013 v1.0 # last edit: 2014-04-09 # required: # samtools Version: 0.1.19-44428cd # Samtools in recent versions has replaces pileup with mpileup. # Keep this in mind when you read online documentation as functionalities # have significantly changed with mpileup. # full chromosome # infolder=hg19_bwa-mapping/full_chromosome # outfolder=samtools_genomeCvg/full_chromosome ref=ref/HiSeq_UCSC_hg19.fa genomeint=ref/hg19.100k.bed infolder=Final_Results/hg19_bwa-mapping outfolder=samtools_genomeCvg mkdir -p ${outfolder} # process bam data by mapping methods, compress, and save for b in ${infolder}/*_mdup_chr21.bam; do # using 'expr': another way to extract string subsets name=$(expr "$b" : '.*shuffled_\(.*\)_mdup_chr21.bam$') samtools depth ${b} | \ bgzip -c > ${outfolder}/depth_${name}.txt.gz done
The resulting data is not ready for visualisation and would require more reformatting.
samtools coverage first lines for bwa_aln on the 10pc sample
chr21 9467391 1 chr21 9467392 1 chr21 9467393 1 chr21 9467394 1 chr21 9467395 1 chr21 9467396 1 chr21 9467397 2 chr21 9467398 2 chr21 9467399 2 chr21 9467400 2 ...
compute 'per base' genome coverage with bedTools genomecov
Another tool to compute global coverage per base is bedtools genomecov, part of the very complete bedtools suite. More info about this command here[3]
#! /usr/bin/env bash # script: bedtools_genomecoverage.sh # required: # bedtools Version: v2.17.0 or more # IGVtools # Summary: Compute the coverage of a feature file across a genome. # # Usage: bedtools genomecov [OPTIONS] -ibam <bam> -g <genome> # we use here the genome size file generated above from the UCSC SQL query genome=ref/hg19.genome # sample or full data infolder=Final_Results/hg19_bwa-mapping outfolder="bedtools_genomeCvg" mkdir -p ${outfolder} for b in ${infolder}/*_mdup_chr21.bam; do # using 'expr': another way to extract string subsets name=$(expr "$b" : '.*shuffled_\(.*\)_mdup_chr21.bam$') echo echo "# processing ${name}" # computing a coverage summary used to plot the coverage distribution with R # we only keep chr21 lines obviously bedtools genomecov -ibam ${b} \ -g ${genome} | grep "^chr21" > ${outfolder}/${name}-histo.txt # compute full genome coverage for IGV track # -bga: Report depth in BedGraph format like with -bg. # However with this option, regions with zero coverage are also reported. # This allows one to quickly extract all regions of a genome with 0 coverage # by applying: “grep -w 0$” to the output (see below). # the output is fairly large and can be reduced with '''igvtools toTDF''' bedtools genomecov -bga \ -ibam ${b} \ -g ${genome} > ${outfolder}/${name}.bg # convert .bg to .igv and compress with igvtools on the fly (gawk 'BEGIN{FS="\t"; OFS="\t"}{print $1, $2, $3, "genomecov", $4}' \ ${outfolder}/${name}.bg > ${outfolder}/${name}-cvg.igv) && \ (igvtools toTDF -z 4 -f min,max,mean ${outfolder}/${name}-cvg.igv \ ${outfolder}/${name}-cvg.tdf hg19) # extract 0-coverage regions from the bg file to a second file with '1' as score grep -w 0$ ${outfolder}/${name}.bg | \ ( gawk 'BEGIN{FS="\t"; OFS="\t"}{print $1, $2, $3, "no-coverage", 1}' \ > ${outfolder}/nocvg-${name}-cvg.igv ) && \ ( igvtools toTDF -z 4 -f min,max,mean ${outfolder}/nocvg-${name}-cvg.igv \ ${outfolder}/nocvg-${name}-cvg.tdf hg19 ) done
files resulting from the above script for the full data
-rw-rw-r-- 1 bits bits 56K May 6 10:46 nocvg-PE_NA18507_GAIIx_100_chr21_aln-pe-cvg.igv -rw-rw-r-- 1 bits bits 66K May 6 10:46 nocvg-PE_NA18507_GAIIx_100_chr21_aln-pe-cvg.tdf -rw-rw-r-- 1 bits bits 63K May 6 10:51 nocvg-PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.igv -rw-rw-r-- 1 bits bits 72K May 6 10:51 nocvg-PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.tdf -rw-rw-r-- 1 bits bits 424M May 6 10:44 PE_NA18507_GAIIx_100_chr21_aln-pe.bg -rw-rw-r-- 1 bits bits 581M May 6 10:45 PE_NA18507_GAIIx_100_chr21_aln-pe-cvg.igv -rw-rw-r-- 1 bits bits 54M May 6 10:46 PE_NA18507_GAIIx_100_chr21_aln-pe-cvg.tdf -rw-rw-r-- 1 bits bits 30K May 6 10:42 PE_NA18507_GAIIx_100_chr21_aln-pe-histo.txt -rw-rw-r-- 1 bits bits 425M May 6 10:49 PE_NA18507_GAIIx_100_chr21_mem-pe.bg -rw-rw-r-- 1 bits bits 582M May 6 10:49 PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.igv -rw-rw-r-- 1 bits bits 54M May 6 10:50 PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.tdf -rw-rw-r-- 1 bits bits 46K May 6 10:47 PE_NA18507_GAIIx_100_chr21_mem-pe-histo.txt
- the first part of the first script will output a text histogram table for each chromosome with the count of positions having each possible coverage value (eg PE_NA18507_GAIIx_100_chr21_mem-histo.txt). Such histogram can be used to plot the coverage curve often seen in NGS reports (see next section using excel and [R])
- the second output PE_NA18507_GAIIx_100_chr21_mem-cvg.bg is a bedgraph formatted file that we can convert to IGV text format and then TDF binary format and visualize in IGV.
- the third part will extract regions of no coverage from the .bg data to identify mapping issues or loci absent from the sequenced genome (homozygous deletions).
The TDF file is a binary form of the IGV format that is lighter for file transfer and faster to load in IGV. By contrast it is not readable anymore and not applicable to downstream text parsing (give and take)
Coverage can be visualized in IGV directly from the BAM data (IGV will compute a coverage track from it), or by loading the TDF file obtained above in IGV side by side with other data. Advantage of the TDF data is that it is visible in full chromosome view while the BAM data only becomes visible at high zoom level.
view bedtools TDF coverage data in IGV
no-coverage regions in the 5'-end of the TPTE gene (chr21:10,988,000-10,993,000)
gap regions close to the chr21 centromer (chr21:10,310,800-11,582,800)
- Clone - gaps between clones in the same map contig. These may be bridged or not.
- Contig - non-bridged gaps between map contigs.
- Centromere - non-bridged gaps from centromeres.
- Telomere - non-bridged gaps from telomeres.
- Heterochromatin - non-bridged gaps from large blocks of heterochromatin.
- Short Arm - non-bridged long gaps on the short arm of the chromosome.
Legend: The figure shows both no-coverage and gaps (regions with only N's) from the hg19_gap.bed track. Different gap types are shown and labelled accordingly.
The IGV inspection of genome coverage data often allows the identification of sequencing artefacts ('coverage gaps or extreme coverage levels'). The gaps can be real when a portion of the chromosome is absent from the DNA sample. Other gaps reveal sequencing biases due to GC content (as shown below) or other mappability issues (see NGS_Exercise.8 for more examples). Regions of apparent extreme coverage can reveal genome duplications or CNV (many copies of a unique reference sequence in the DNA sample lead to many more reads than what is expected for a diploid genome)
When opening the aln TDF file, a very high coverage (>400x) is obtained in the BAGE region (chr21:10,669,922-11,221,719). This phenomenon is reported in Genecard (<http://www.genecards.org/cgi-bin/carddisp.pl?gene=BAGE4>).Miscellaneous: The ancestral BAGE gene was generated by juxtacentromeric reshuffling of the MLL3 gene. The BAGE family was expanded by juxtacentromeric movement and/or acrocentric exchanges. BAGE family is composed of expressed genes that map to the juxtacentromeric regions of chromosomes 13 and 21 and of unexpressed gene fragments that scattered in the juxtacentromeric regions of several chromosomes, including chromosomes 9, 13, 18 and 21
BAGE4 locus with abnormally high coverage
The high copy-number of that region in the NA18507 genome led to a large excess of reads mapping to the single represented locus (called reference genome compression). The artefact is confirmed by the presence of a very large number of variants across the lower BAM track that reflect the multiple sequence differences between BAGE gene copies (note that these variants will likely be ignored during calling by setting a max value for coverage).
MS-Excel can be used to print the coverage depth distribution for chr21
The content of the second and third columns of the PE_NA18507_GAIIx_100_chr21_mem-histo.txt file are used as data to build a scatter plot using MS-Excel.
excel coverage plot drawn from PE_NA18507_GAIIx_100_chr21_mem-histo.txt
a nicer [R] script to plot the coverage depth distribution for chr21
Copy the code below into a blank RStudio script page or load 'plot_coverage.R' from the 'scripts' folder
'plot_coverage.R' script
# script: plot_coverage.R # required to add the table to the plot # install once with 'install.packages("plotrix")' library(plotrix) setwd("~/NGS_DNASeq-training/bedtools_genomeCvg") for (met in c("aln", "mem")) { data <- paste("NA18507_GAIIx_100_chr21_", met, "-histo.txt", sep="") chr21 <- read.delim(data, header=F) colnames(chr21) <- c("chr", "depth", "count", "chrlen", "cov%") # get depth pic (avoid coverage depth=0) # pic <- chr21$depth[which.max(chr21$count[chr21$depth>0])]+1 # find X-coordinates for 5%, Q1, Q3, 95% sub <- chr21[chr21$depth>0,] vec <- rep(sub$depth, sub$count) Q005 <- quantile(vec, 0.05) Q1 <- summary(vec)[2] Q2 <- summary(vec)[3] Q3 <- summary(vec)[5] Q95 <- quantile(vec, 0.95) # create table for plot leg.table <- data.frame(depth=c(Q005, Q1, Q2, Q3, Q95)) # create pictures with coverage distribution filename <- paste("sample_", met, "_coverage-plot.png", sep="") #filename <- paste(met, "_coverage-plot.png", sep="") title <- paste("coverage depth (", met, ")", sep="") png(filename = filename, width = 250, height = 250, units = "px", pointsize = 12, bg = "white") # limits for all and sample datasets #xmax=80 xmax=10 #ymax=1500000 ymax=10000000 plot(chr21$count ~ chr21$depth, xlim=c(0, xmax), ylim=c(0, ymax), xlab=title, ylab="bp-count", type="l", col="blue", lwd=3, cex=0.5) # add vertical line at pic # abline(v=pic-0.5, col="grey", lwd=2) abline(v=Q005, col="grey", lwd=2) abline(v=Q1, col="grey", lwd=2) abline(v=Q2, col="red", lwd=2) abline(v=Q3, col="grey", lwd=2) abline(v=Q95, col="grey", lwd=2) # show most of the options addtable2plot(Q95, (max(chr21$count)/20), leg.table, bg="white", bty="o", display.rownames=TRUE, hlines=TRUE, vlines=TRUE, cex=0.75) dev.off() }
coverage depth distribution plots for all mappings in aln and mem BAM files
coverage depth from all BWA-aln/mem reads | 10pc results |
---|---|
NA18507 was sequenced with a median coverage depth of >40x which is a very decent value. The large number of bases with '0' coverage are located in the short arm of chr21, centromere, and telomeres where repeat frequency is highest and N-bases are very frequents (see IGV figures above).
compute the smoothed coverage in 10kb interval
create 10kb & 100kb *bins* for the 'hg19' reference genome
To normalize for local effects and obtain smoothed results, a stepping window of 10/100kb (arbitrary choice) is built using bedtools makewindows and used later with bedtools commands
#! /usr/bin/env bash # script: bedtools_makewindows.sh # required: # bedtools Version: v2.17.0 or more # Divide the human genome into stepping windows of 10kb, with no overlap # Usage: bedtools makewindows [OPTIONS] [-g <genome> OR -b <bed>] # [ -w <window_size> OR -n <number of windows>, -s <step_size> ] # 10kb stepping bedtools makewindows -g ref/hg19.genome \ -w 10000 > ref/hg19.10k.bed echo "# preview 10k windows (first 3 lines)" head -3 ref/hg19.10k.bed # chr1 0 10000 # chr1 10000 20000 # chr1 20000 30000 # 100kb stepping bedtools makewindows -g ref/hg19.genome \ -w 100000 > ref/hg19.100k.bed echo "# preview 100k windows (first 3 lines)" head -3 ref/hg19.100k.bed # chr1 0 100000 # chr1 100000 200000 # chr1 200000 300000
compute the average coverage with bedTools coverage in 10kb interval
If you wish instead to report a smoother local average value of read coverage, you can use the bedtools makewindows data generated earlier and summarize BAM coverage over the defined intervals with bedtools coverage. The output results are IGV | TDF formatted files that we can visualize in IGV to identify problematic genomic regions.
#! /usr/bin/env bash # scripts: bedtools_10k-coverage.sh # Required: bedtools v2.17.0 or higher # IGVTools # compute %coverage in 10k bins # Tool: bedtools coverage (aka coverageBed) # Summary: Returns the depth and breadth of coverage of features from A # on the intervals in B. # # Usage: bedtools coverage [OPTIONS] -abam <bam> -b <hg19.10k.bed> # Default Output: # After each entry in B, reports: # 4) The number of features in A that overlapped the B interval. # 5) The number of bases in B that had non-zero coverage. # 6) The length of the entry in B. # 7) The fraction of bases in B that had non-zero coverage. # we use here the genome size file generated above from the UCSC SQL query genome=ref/hg19.genome genomeint=ref/hg19.10k.bed infolder=Final_Results/hg19_bwa-mapping outfolder="bedtools_genomeCvg.10k" mkdir -p ${outfolder} # process chr21 BAM files for b in ${infolder}/*_mdup_chr21.bam; do # using 'expr': another way to extract string subsets name=$(expr "$b" : '.*shuffled_\(.*\)_mdup_chr21.bam$') echo echo "#processing ${name}" outfile=${name}_10k-cvg bedtools coverage -abam ${b} \ -b ${genomeint} | sort -k 1V,1 -k 2n,2 -k 3n,3 \ > ${outfolder}/${outfile}.txt # transform to igv for visualization # convert .txt to .igv and compress with igvtools on the fly gawk 'BEGIN{FS="\t"; OFS="\t"}{print $1, $2, $3, "percent.cvg", $7}' \ ${outfolder}/${outfile}.txt > ${outfolder}/${outfile}.igv igvtools toTDF -z 4 -f min,max,mean ${outfolder}/${outfile}.igv \ ${outfolder}/${outfile}.tdf hg19 done
files resulting from the above script and full data
-rw-rw-r-- 1 bits bits 14M May 6 11:46 PE_NA18507_GAIIx_100_chr21_aln-pe_10k-cvg.igv -rw-rw-r-- 1 bits bits 3.2M May 6 11:46 PE_NA18507_GAIIx_100_chr21_aln-pe_10k-cvg.tdf -rw-rw-r-- 1 bits bits 13M May 6 11:46 PE_NA18507_GAIIx_100_chr21_aln-pe_10k-cvg.txt -rw-rw-r-- 1 bits bits 14M May 6 11:48 PE_NA18507_GAIIx_100_chr21_mem-pe_10k-cvg.igv -rw-rw-r-- 1 bits bits 3.2M May 6 11:49 PE_NA18507_GAIIx_100_chr21_mem-pe_10k-cvg.tdf -rw-rw-r-- 1 bits bits 13M May 6 11:48 PE_NA18507_GAIIx_100_chr21_mem-pe_10k-cvg.txt
partial coverage in the ANKRD30BP2 gene region (chr21:14,445,900-14,480,100)
bedtools allows many more computations and reads from BED, VCF, BAM and more formats given a little bit of gawk magic.
create a Circos plot with the 10kb smoothed results
This section is provided as is and will not be executed during the training.
Feel free to recycle the code and make it yours.
Circos plots are quite popular due to their attractive aesthetics but are not ideal for detailed inspection. We provide this starter code (very basic plot indeed!) to introduce you to the world of Circos. Up to you to explore this further if interested. You will find all necessary information on about to install, configure, and use Circos in Martin Krzywinski pages <http://circos.ca>[4]. Some likely missing files required to run the code below were copied to the exercise page on our server in the 'missing_etc-files.zip' archive . To reuse the script on your own machine, please extract the contained files to your own 'path-to-circos/etc' folder.
Perl script used to create the Circos data, configuration file and finally plot
#!/usr/bin/perl use strict; use POSIX qw/strftime ceil/; use File::Basename; # required: install with 'sudo cpan File::Basename' use Statistics::Descriptive; # script: plot_with_circos.sh # create Circos plot from bedtools_genomeCvg.igv data # parse a bedtools 10k-cvg.igv file # chr1 0 10000 percent.cvg 0.0000000 # chr1 10000 20000 percent.cvg 0.0146000 # chr1 20000 30000 percent.cvg 0.0000000 # chr1 30000 40000 percent.cvg 0.0000000 # chr1 40000 50000 percent.cvg 0.0000000 # chr1 50000 60000 percent.cvg 0.0000000 # # Stéphane Plaisance VIB-BITS Nov-04-2013 v1.0 @ARGV == 2 || die ("usage: plot_with_circos.sh <bedtools_genomeCvg.igv> <target-chr>\n"); my ($infile, $onechr) = @ARGV; # convert in circos naming my $circhr = $onechr; $circhr =~ s/chr/hs/; print stderr "chromosome:",$circhr,"\n"; my $name = basename($infile, ".igv"); my $conffile=$name."_coverage-circos.conf"; my $plotfile=$name."_coverage-plot.png"; my $datafile=$name."_coverage-data.txt"; my $stat = Statistics::Descriptive::Full->new(); # one pass through data to compute min mid and max my $z = OpenArchiveFile ($infile) || die "Error reading from $infile: $!\n"; my $cnt = 0; print STDERR "# parsing data (\'.\'=10000 rows): "; while (my $line = <$z>) { my ($chr, undef, undef, undef, $val) = split(/\t/, $line); # autosomes only next if $chr != $onechr; # exclude no-cvg next if $val==0; $cnt++; $cnt =~ /00000$/ && print STDERR "."; $stat->add_data(100*$val); } undef $z; my $count = $stat->count(); my $min = $stat->quantile(1); my $med = $stat->quantile(3); my $max = 1.1*$stat->quantile(4); print STDERR "# analyzed:$count\n# min\tmed\tmax\n# $min\t$med\t$max\n"; # call sub to create .conf file print STDERR ". creating conf and data files\n"; createConf(); # extract column from quantile file and create Circos data file my $z = OpenArchiveFile ($infile) || die "Error reading from $infile: $!\n"; open(DATA, "> $datafile") || die "Error: cannot create output :$!\n"; while (my $line = <$z>) { my ($chr, $start, $end, undef, $val) = split(/\t/, $line); # autosomes only next if $chr != $onechr; # rename 'chr' to 'hs' for circos $chr =~ s/chr/hs/; print DATA join ("\t", $chr, $start, $end, 100*$val)."\n"; } close data; undef $z; print STDERR ". creating Circos plot\n"; system ("circos -conf $conffile"); #=============== #==== SUBS ===== #=============== sub OpenArchiveFile { my $infile = shift; my $FH; if ($infile =~ /.bz2$/) { open ($FH, "bzcat $infile |") or die ("$!: can't open file $infile"); } elsif ($infile =~ /.gz$/) { open ($FH, "zcat $infile |") or die ("$!: can't open file $infile"); } elsif ($infile =~ /.tsv|.txt|.bedg|.igv$/) { open ($FH, "cat $infile |") or die ("$!: can't open file $infile"); } else { die ("$!: do not recognise file type $infile"); # if this happens add, the file type with correct opening proc } return $FH; } # generate conf file paired to the data sub createConf { open(CONF, "> $conffile") || die "Error: cannot create output file :$!\n"; my $circos_root = $ENV{BIOTOOLS}."/circos"; my $circos_etc = $ENV{BIOTOOLS}."/circos/etc"; my $circos_data = $ENV{BIOTOOLS}."/circos/data"; print CONF <<"EOT"; <<include $circos_etc/colors_fonts_patterns.conf>> <<include $circos_etc/ideogram.conf>> <<include $circos_etc/ticks.conf>> <image> file* = $plotfile <<include $circos_etc/image.conf>> </image> karyotype = $circos_data/karyotype/karyotype.human.hg19.txt #chromosomes_units = 1000000 #chromosomes_display_default = yes chromosomes_units = 1000000 chromosomes_display_default = no chromosomes = $circhr chromosomes_breaks = -$circhr:120-140 <plots> # tall histogram immediately inside the ideogram circle # background* parameters define a colored backdrop for this histogram # axes* define y-axes <plot> type = histogram file = $datafile r1 = 0.9r r0 = 0.7r max = $max min = 0 stroke_type = outline thickness = 4 color = vdgrey extend_bin = no <backgrounds> <background> color = vvlgrey </background> </backgrounds> <axes> <axis> spacing = 0.1r color = lgrey thickness = 2 </axis> </axes> <rules> <rule> use = no condition = var(value) < 0 show = no </rule> <rule> #condition = var(value) < 0 condition = 1 fill_color = eval(sprintf("spectral-9-div-%d",remap_int(var(value),-1,1,1,9))) </rule> </rules> </plot> </plots> <<include etc/housekeeping.conf>> EOT close CONF; }
chr21 coverage plot for the BWA mem mappings
CircosAPI is a separate package helping build circos configuration files and can be obtained at <http://kylase.github.io/CircosAPI/>[5] together with a nice tutorial [6]
For a good tutorial to setup circos & circos_api as a virtual machine on your non-unix computer, see here[7]
count how much of chr21 non-gap regions was not sequenced with all reads
This little exercise shows classic use of grep, wc, and awk; your best friends in many situations.
## script: count_no-coverage.sh ## ©SP-BITS, 2013 v1.0 # last edit: 2014-04-09 # compute for full bwa_mem chr21 data infolder=bedtools_genomeCvg/full_chromosome outfolder=bedtools_genomeCvg/full_chromosome mkdir -p ${outfolder} infile=nocvg-PE_NA18507_GAIIx_100_chr21_mem.igv outfile=chr21_${infile%%.igv}.bed # create subset for chr21 grep "^chr21" ${infolder}/${infile} > ${outfolder}/${outfile} len=$( wc -l ${outfolder}/${outfile} | cut -d " " -f 1 ); wid=$( awk 'BEGIN{FS="\t";OFS="\t";sum=0}{sum+=$3-$2}END{print sum}' ${outfolder}/${outfile} ); echo -e "count\twidth\n${len}\t${wid}" | column -t
count width 1693 14094487
The answer suggests that 1693 regions of chr21 are not covered by reads for a total width of more than 14Mbases. This seems a lot and of course most of this is due to telomeres and centromeres (hashed in grey in the pictogram below).
[8]How can we compute how much is not due to centromers?
Bedtools subtract allows excluding gaps from the result to obtain the sample specific no-coverage (see more about this command at: <http://bedtools.readthedocs.org/en/latest/content/tools/subtract.html>[9]).
the bedtols subtract command
Tool: bedtools subtract (aka subtractBed) Version: v2.17.0-111-gab7bbbe Summary: Removes the portion(s) of an interval that is overlapped by another feature(s). Usage: bedtools subtract [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf> Options: -f Minimum overlap required as a fraction of A. - Default is 1E-9 (i.e., 1bp). - (FLOAT) (e.g. 0.50) -s Require same strandedness. That is, only subtract hits in B that overlap A on the _same_ strand. - By default, overlaps are subtracted without respect to strand. -S Force strandedness. That is, only subtract hits in B that overlap A on the _opposite_ strand. - By default, overlaps are subtracted without respect to strand. -A Remove entire feature if any overlap. That is, by default, only subtract the portion of A that overlaps B. Here, if any overlap is found (or -f amount), the entire feature is removed. -N Same as -A except when used with -f, the amount is the sum of all features (not any single feature).
The gap file was obtained from the UCSC table browser as described below and is stored under ref/hg19_gaps.bed.
how do I get gaps from the UCSC table browser?
## script: correct_no-coverage.sh ## ©SP-BITS, 2013 v1.0 # last edit: 2014-04-09 # compute for full bwa_mem chr21 data infolder=bedtools_genomeCvg/full_chromosome outfolder=bedtools_genomeCvg/full_chromosome outfile=chr21_${infile%%.igv}.bed mkdir -p ${outfolder} infile=chr21_nocvg-PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.bed gaps=ref/hg19_gaps.bed outfile=chr21_nogap_nocvg-PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.bed bedtools subtract -a ${infolder}/${infile} -b ${gaps} > ${outfolder}/${outfile} len=$( wc -l ${infolder}/${outfile} | cut -d " " -f 1 ); wid=$( awk 'BEGIN{FS="\t";OFS="\t";sum=0}{sum+=$3-$2}END{print sum}' ${infolder}/${outfile} ); echo -e "count\twidth\n${len}\t${wid}" | column -t
count width 1693 1073302
The large number of small no-coverage regions found is partly due to the fact that we work here with only 10% of the reads and have a limited sequencing depth. A possible next step would be to retrieve these sequences in order to analyze their base content or try to amplify them by conventional PCR for manual re-sequencing. Some of this is detailed in a supplementary exercise: Advanced usage of BEDTOOLS to retrieve unsequenced regions of chr21 for Sanger re-sequencing that you can read but do not need to do now.
What is the length distribution of the no-coverage regions?
We can also compute such basic descriptive statistics using R (and make plots) or more simply by applying awk on the BED files obtained above and the nice little qstats to compute simple stats. [11]. Note that a similar tool called stats is part of the filo package developed by Aaron Qinlan [12]
## script: gap_sizes.sh ## ©SP-BITS, 2013 v1.0 # last edit: 2014-04-09 # BED data obtained from the former scripts infolder=bedtools_genomeCvg/full_chromosome outfolder=bedtools_genomeCvg/full_chromosome mkdir -p ${outfolder} all=chr21_nocvg-PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.bed nongap=chr21_nogap_nocvg-PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.bed # we compute the no-coverage length with end-start awk 'BEGIN{FS="\t";OFS="\t";sum=0}{print $3-$2}' ${infolder}/${all} | qstats Min. 1 1st Qu. 6 Median 44 Mean 8325.15 3rd Qu. 118 Max. 9.43619e+06 Range 9.43619e+06 Std Dev. 241707 Length 1693 awk 'BEGIN{FS="\t";OFS="\t";sum=0}{print $3-$2}' ${infolder}/${nongap} | qstats Min. 1 1st Qu. 6 Median 44 Mean 633.965 3rd Qu. 118 Max. 61261 Range 61260 Std Dev. 3280.6 Length 1693
The results confirm that we mainly removed very large regions (centromeres) and that the remaining no-coverage blocks are relatively short with
* a median width of 44bps
* 75% of the fragments shorter or equal to 118bps.
Note about coverage modes in RNA-Seq experiments
In the case of spliced reads (RNA-Seq), coverage can be computed in various ways illustrated in the graph below. While this is not the aim of this training session, it is a worth mentioning detail. Please follow links below to get more info about HTSeq tools.
HTSeq[13] is originally a python library but it can also be used directly in terminal to compute coverage from the command htseq-count.
HTSeq illustration for coverage count modes
download exercise files
Download exercise files here
References:
- ↑ http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability
- ↑ http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/
- ↑ http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html
- ↑ http://circos.ca
- ↑ http://kylase.github.io/CircosAPI/
- ↑ http://kylase.github.io/CircosAPI/tutorials/
- ↑ http://kylase.github.io/CircosAPI/circos-on-linux-virtual-machine/
- ↑ http://www.ncbi.nlm.nih.gov/mapview/maps.cgi?TAXID=9606&CHR=21&MAPS=ideogr,genec[0.00%3A3200.00]&QSTR=SOD1%20OR%20APS1&QUERY=uid(313,6305,61194,9026314)&rsize=3200&GOTO=3053.71human%253A21%253AISCN
- ↑ http://bedtools.readthedocs.org/en/latest/content/tools/subtract.html
- ↑ http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=352776079&boolshad.hgta_printCustomTrackHeaders=0&hgta_ctName=tb_gap&hgta_ctDesc=table+browser+query+on+gap&hgta_ctVis=pack&hgta_ctUrl=&fbQual=whole&fbUpBases=200&fbDownBases=200&hgta_doMainPage=cancel
- ↑ https://github.com/tonyfischetti/qstats
- ↑ https://github.com/arq5x/filo
- ↑ http://www-huber.embl.de/users/anders/HTSeq
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.3 | NGS Exercise.3b | NGS Exercise.3c | NGS Exercise.3d | NGS Exercise.3e ]