NGS RNASeq DE Exercise.4
Count reads at gene level using HTSeq
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.3 | NGS_RNASeq_DE_Exercise.5 ]
Contents
Introduction
In order to compute differential expression between sample groups, we first need to convert mapping results to gene level counts for each sample and merge all resulting count tables into a global table to be fed to R-bioconductor packages for statistical analysis. The counting can also be done in R using various packages but will be slower due to the lesser support for multi-threading and parallelization as compared to command-line.
Compute gene counts from BAM data
The following script perform gene count from each BAM file and saves the results to a simple two-column text file for each sample. We use the popular HTSeq-count executable [1] to compute gene counts. Note that other packages in R (Rsamtools, countOverlaps from IRanges, summarizeOverlaps from GenomicRanges) as well as standalone software like Qualimap propose methods for this aim.
HTSeq counts transcriptome hits from BAM data in different 'modes' proposed to the user and detailed below.
More information about this universally used package can be found on the developer web site
bam_htseq-count.sh script
#!/bin/bash # bam_htseq-count.sh # run HTSeq on single BAM sample # count only uniquely mapped mappings & alignQ > 10 # requires HTSeq (0.6.0+) ## requires sam data sorted by 'queryname' as in our markdup set # help: batch apply this script to all bams with ## for b in tophat2.0.13_results/markdup_bams/*_all-tophat-q.bam; do \ ## echo "# analyzing ${b}" \ ## scripts/bam_htseq-count.sh $b\ ## done # or even better run 5 jobs in 'parallel' if you have power # find tophat2.0.13_results/markdup_bams/ -name "*_all_mdup.bam" | \ # parallel --no-notice --workdir . --tmpdir ./tmp -j 5 \ # scripts/bam_htseq-count.sh {} # takes a single bam file from $1 if [ $# -lt 1 ] then echo "Usage: ${0##*/} <bam file> eg: tophat2.0.13-SRR1039509_results/markdup_bams/SRR1039509_all_mdup.bam" exit fi # argument#1 gives the bam file bam=$1 name=$(basename ${bam} ".bam") # test if bam is queryname-sorted order="$(samtools view -H ${bam} | head -1 | awk '{split($3, ord, ":"); print ord[2]}')" echo "# order is $order" echo if [ "$order" = "coordinate" ]; then echo "# sorting in queryname order" # redefine qbam qbam=${bam//.bam/-q.bam} cmd="java -jar $PICARD/picard.jar SortSam \ I=${bam} \ O=${bam//.bam/-q.bam} \ SO=queryname" eval ${cmd} && bam=${bam//.bam/-q.bam} fi # create output folder outfolder="htseq_counts" mkdir -p ${outfolder} # select transcript annotation model gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf # other options would be #gtffile=$BOWTIE2_INDEXES/chr22-hg19_ensGene.gtf #gtffile=$BOWTIE2_INDEXES/hg19_refGene.gtf # decompress bam with 'samtools' and use default parameters # possible IDs are: ## transcript_name (gene symbol) ## transcript_id (ENST-ID) ## exon_id (ENSE-ID) ## gene_id (ENSG-ID:default) ++ cmd="samtools view ${bam} | \ htseq-count \ -m intersection-strict \ -s no \ -a 10 \ -t exon \ -i gene_id \ - \ ${gtffile} \ > ${outfolder}/${name}_counts.txt \ 2> >(tee -a ${outfolder}/${name}_errorlog.txt >&2)" echo "# ${cmd}" eval ${cmd} echo # compute simple stats on ENSG transcripts with coverage # use non-null data from col#2 when col#1 contains a valid "ENSG ID" echo "# simple coverage stats for transcripts with read counts" echo "# ${bam}" awk 'BEGIN{FS="\t"; OFS="\t"}{if($1~/ENSG/ && $2>0) print $2}' \ ${outfolder}/${name}_counts.txt | qstats -s
Install R and bioconductor
This section does not need to be done for BITS laptops but is required if you run the code on your own machine; We do not provide all details on steps required to install the require base software and dependencies as they differ depending on your operating system. Please read the documentation behind each link/
- install R on your computer, if you rinstalled R version is outdated (eg installed using apt-get on ubuntu), consider removing it and reinstalling the latest stable version from CRAN.
- install the latest RStudio version from the RStudio site
- launch RStudio and run the following code to configure the basic bioconductor installation and update outdated packages in your installed version.
Install and Update RStudio and R packages
## Change default Bioconductor and CRAN mirrors ## this part will be run interactively (not run here) #chooseBioCmirror() #chooseCRANmirror() ## identify where R looks for packages # locate library folder .libPaths() ## make sure that you have writing rights in this place ## as this will allow you installing packages for all users of your machine # see what is currently installed library() ###### Install bioconductor (takes some time and need storage and internet speed) ##### # If you don't have the BiocInstaller package installed, you can quickly install and load it as follows: source("http://bioconductor.org/biocLite.R") # then update install the bioconductor base packages # as well as update all existing packages with the very easy command biocLite() ## upgrade bioconductor might be relevant if you have a very old version running from ages biocLite("BiocUpgrade") # next time you want to update packages, just come back and run source("http://bioconductor.org/biocLite.R") biocLite() # when you need a new package and know its name, most often this will do biocLite("<the name of the new package>") ## related to this training, you can access packages and their documentation within RStudio source("http://bioconductor.org/biocLite.R") # a full workflow related to the extended DESeq2 demo source("http://bioconductor.org/workflows.R") workflowInstall("rnaseqGene") library("rnaseqGene") vignette("rnaseqGene") # THE package library("DESeq2") vignette("DESeq2") browseVignettes("DESeq2") # data used by DESeq2 library("airway") browseVignettes("airway") # EdgeR library("edgeR") vignette("edgeR") ###### happy R'ing
RStudio can generate PDF, HTML, and Word documents from your code that also include figures and more
This is a really cool feature to write reports and share them. You will need for his a couple more dependencies at system level and will find their name when trying and failing to compile your first document (Dr G will help you fixing these annoying first steps)
Compare raw HTSeq counts from tophat 'all' and gtf' mappings
The Raw HTSeq counts obtained for the sample (SRR1039509) were used to evaluate the effect of mapping reads without or with guidance from a transcriptome file. HTSeq conversion of both BAM files returned counts distributions summarized with qstats as follows:
A typical HTSeq count file looks like this
ENSG00000000005 0
ENSG00000000419 488
ENSG00000000457 226
ENSG00000000460 52
ENSG00000000938 0
ENSG00000000971 3159
ENSG00000001036 1524
ENSG00000001084 386
ENSG00000001167 257
...
Summary lines reported for each HTSeq run
SRR1039509 | all-counts | gtf-counts |
---|---|---|
no_feature | 1871251 | 1212702 |
ambiguous | 567580 | 691622 |
too_low_aQual | 0 | 0 |
not_aligned | 0 | 0 |
alignment_not_unique | 5535220 | 2553851 |
Descriptive statistics for non-null counts obtained in the two tophat runs
# consider lines starting with ENSG # take column #2 # ignore count=0 # compute descriptive stats using qstats awk 'BEGIN{FS="\t"; OFS="\t"}{if($1~/ENSG/ && $2>0) print $2}' \ ${outfolder}/chr22-${name}_counts.txt | qstats -s
SRR1039509 | counts for 'all' | counts for 'gtf' |
---|---|---|
Min. | 1 | 1 |
1st Qu. | 5 | 4 |
Median | 78 | 79 |
Mean | 755.246 | 832.739 |
3rd Qu. | 527 | 557 |
Max. | 210019 | 214935 |
Range | 210018 | 214934 |
Std Dev. | 4181.34 | 4554.83 |
Length | 21908 | 21491 |
The data from both methods was used to plot 'all' vs 'gtf' counts. If both methods were identical, all points should be on the red diagonal. The next figure shows that the results are highly similar but not identical and that the 'all' counts presents slightly more variability. Dotted lines at 10 or 20-counts (3.32 or 4.32 in log2) delimit the lower region of low precision in which results differ the most.
compare-htseq_counts.R script
#!/usr/bin/RScript # compare-htseq_counts.R # copy this code to RStudio and adapt file locations to match yours # basedir <- "/work/TUTORIALS/NGS_RNASeqDE-training2015/SRR1039509_data/tophat2.0.13-SRR1039509_results/htseq_counts" basedir <- <edit here for your machine> setwd(basedir) # Take htseq-count for 'all' and 'gtf' and make scatter-plot # use sample SRR1039509 for plot SRR1039509_all <- read.delim("SRR1039509_all_counts.txt", header=F) SRR1039509_gtf <- read.delim("SRR1039509_gtf_counts.txt", header=F) data <- merge(SRR1039509_all, SRR1039509_gtf, by='V1') colnames(data) <- c("ID", "SRR1039509_all", "SRR1039509_gtf") head(data, 10) # ID SRR1039509_all SRR1039509_gtf # 1 __alignment_not_unique 5535220 2553851 # 2 __ambiguous 567580 691622 # 3 __no_feature 1871251 1212702 # 4 __not_aligned 0 0 # 5 __too_low_aQual 0 0 # 6 ENSG00000000003 434 440 # 7 ENSG00000000005 0 0 # 8 ENSG00000000419 488 506 # 9 ENSG00000000457 226 234 # 10 ENSG00000000460 52 56 # extract summary lines data.summary <- data[grep("^ENS", data$ID, perl=TRUE, invert=TRUE), ] rownames(data.summary) <- data.summary$ID data.summary <- data.all.summary[,-1] data.summary # ID SRR1039509_all SRR1039509_gtf # __alignment_not_unique __alignment_not_unique 5535220 2553851 # __ambiguous __ambiguous 567580 691622 # __no_feature __no_feature 1871251 1212702 # __not_aligned __not_aligned 0 0 # __too_low_aQual __too_low_aQual 0 0 # extract gene data lines data <- data[grep("^ENS", data$ID, perl=TRUE, invert=FALSE), ] head(data) # ID SRR1039509_all SRR1039509_gtf # 6 ENSG00000000003 434 440 # 7 ENSG00000000005 0 0 # 8 ENSG00000000419 488 506 # 9 ENSG00000000457 226 234 # 10 ENSG00000000460 52 56 # 11 ENSG00000000938 0 0 # compute correlation coefficient between both datasets cor(data$SRR1039509_all, data$SRR1039509_gtf, use="complete.obs", method="kendall") # [1] 0.9230123 # correlate with more control cor.test(data$SRR1039509_all, data$SRR1039509_gtf, alternative = "two.sided", method = "kendall", exact = NULL, conf.level = 0.95) # Kendall's rank correlation tau # # data: data$SRR1039509_all and data$SRR1039509_gtf # z = 266.8679, p-value < 2.2e-16 # alternative hypothesis: true tau is not equal to 0 # sample estimates: # tau # 0.9230123 # plot scatterplot png(file="SRR1039509-all_vs_gtf-counts.png") plot(log2(data$SRR1039509_all) ~ log2(data$SRR1039509_gtf), xlab="log2-GTF counts - SRR1039509", ylab="log2-ALL counts - SRR1039509", pch=20, cex=0.5 ) # add lines abline(0, 1, col="red", lty=1, lwd=2) abline(h=log2(10), col="grey1", lty=2) abline(v=log2(10), col="grey1", lty=2) abline(h=log2(20), col="grey2", lty=2) abline(v=log2(20), col="grey2", lty=2) dev.off()
This supports excluding counts lower than a certain value (eg 10) from your analyses, that will introduce more noise that reliable information in your results
Combine individual HTSeq files from the 'all' mapping series
The HTSeq program has extracted counts from each BAM file and created one corresponding text file with column#1 providing ENSG-IDs and column 1 the raw counts. We need to merge all these tables (within each experiment) into a single table with a leading identifier column followed by one labelled column for each replicate/condition. This is easily done using R as detailed below. The final results are saved to two text file (one for 'gtf' and one for 'all' results) that can be fed to differential expression analysis tools.
Merging all HTSeq count files into a R object can be done in different ways. Packages like DESeq have specific functions (see DESeqDataSetFromHTSeqCount) to do the merge directly from the text count files but we prefer to present here a pure R code to aggregate the files that can be recycled to aggregate other text files you may obtain while performing high throughput screens.
All HTSeq results are grouped in a common folder and can be batch loaded into RStudio with the following commands. Please adapt the location to your own directory structure if you reuse this code.
htseq-combine_all.R script
#!/usr/bin/RScript # htseq-combine_all.R # copy this code to RStudio and adapt file locations to match yours # Take 'all' htseq-count results and melt them in to one big dataframe # where are we? basedir <- "/work/TUTORIALS/NGS_RNASeqDE-training2015" setwd(basedir) cntdir <- paste(basedir, "htseq_counts", sep="/") pat <- "_all_counts.txt" tophat.all <- list.files(path = cntdir, pattern = pat, all.files = TRUE, recursive = FALSE, ignore.case = FALSE, include.dirs = FALSE) # we choose the 'all' series myfiles <- tophat.all DT <- list() # read each file as array element of DT and rename the last 2 cols # we created a list of single sample tables for (i in 1:length(myfiles) ) { infile = paste(cntdir, myfiles[i], sep = "/") DT[[myfiles[i]]] <- read.table(infile, header = F, stringsAsFactors = FALSE) cnts <- gsub("(.*)_all_counts.txt", "\\1", myfiles[i]) colnames(DT[[myfiles[i]]]) <- c("ID", cnts) } # merge all elements based on first ID columns data <- DT[[myfiles[1]]] # inspect head(data) # ID SRR1039508 # 1 ENSG00000000003 667 # 2 ENSG00000000005 0 # 3 ENSG00000000419 430 # 4 ENSG00000000457 256 # 5 ENSG00000000460 56 # 6 ENSG00000000938 0 # we now add each other table with the ID column as key for (i in 2:length(myfiles)) { y <- DT[[myfiles[i]]] z <- merge(data, y, by = c("ID")) data <- z } # ID column becomes rownames rownames(data) <- data$ID data <- data[,-1] ## add total counts per sample data <- rbind(data, tot.counts=colSums(data)) # inspect and look at the top row names! head(data) # SRR1039508 SRR1039509 SRR1039510 SRR1039511 SRR1039512 SRR1039513 # __alignment_not_unique 6029848 5535220 6917679 5736110 7174795 3829823 # __ambiguous 609790 567580 563265 582869 711348 449909 # __no_feature 2048439 1871251 2186546 2036064 2564408 1387796 # __not_aligned 0 0 0 0 0 0 # __too_low_aQual 0 0 0 0 0 0 # ENSG00000000003 667 434 721 444 862 401 # SRR1039514 SRR1039515 SRR1039516 SRR1039517 SRR1039518 SRR1039519 # __alignment_not_unique 8865980 5533574 7349997 7829016 7357178 5435207 # __ambiguous 1021469 607559 688966 890454 809693 575376 # __no_feature 3464421 2230297 2530061 3020635 2678110 1990138 # __not_aligned 0 0 0 0 0 0 # __too_low_aQual 0 0 0 0 0 0 # ENSG00000000003 958 853 1133 1050 1087 1104 # SRR1039520 SRR1039521 SRR1039522 SRR1039523 # __alignment_not_unique 5019316 5159442 6941580 7078894 # __ambiguous 554414 613151 709211 818199 # __no_feature 1939908 1931496 2641504 2825142 # __not_aligned 0 0 0 0 # __too_low_aQual 0 0 0 0 # ENSG00000000003 750 562 1075 826 tail(data) # SRR1039508 SRR1039509 SRR1039510 SRR1039511 SRR1039512 SRR1039513 SRR1039514 # ENSG00000273487 3 4 4 6 0 9 5 # ENSG00000273488 5 3 4 3 5 3 10 # ENSG00000273489 0 0 1 0 0 2 0 # ENSG00000273492 0 0 0 0 1 0 2 # ENSG00000273493 0 0 0 0 0 0 0 # tot.counts 26792115 24519985 27416297 25783685 33081391 19381676 45865162 # SRR1039515 SRR1039516 SRR1039517 SRR1039518 SRR1039519 SRR1039520 SRR1039521 # ENSG00000273487 4 6 4 5 3 3 8 # ENSG00000273488 4 3 10 6 6 6 11 # ENSG00000273489 2 1 0 2 1 0 0 # ENSG00000273492 1 0 0 1 0 0 0 # ENSG00000273493 0 0 0 0 0 0 0 # tot.counts 27862053 32316085 39563928 35371127 25763223 24653719 26874854 # SRR1039522 SRR1039523 # ENSG00000273487 6 9 # ENSG00000273488 7 6 # ENSG00000273489 3 1 # ENSG00000273492 1 0 # ENSG00000273493 0 1 # tot.counts 32888531 35869342 #################################### # take summary rows to a new table # ( not starting with ENS with invert=TRUE ) # transpose table for readability data.all.summary <- data[grep("^ENS", rownames(data), perl=TRUE, invert=TRUE), ] # review data.all.summary # SRR1039508 SRR1039509 SRR1039510 SRR1039511 SRR1039512 SRR1039513 # __alignment_not_unique 6029848 5535220 6917679 5736110 7174795 3829823 # __ambiguous 609790 567580 563265 582869 711348 449909 # __no_feature 2048439 1871251 2186546 2036064 2564408 1387796 # __not_aligned 0 0 0 0 0 0 # __too_low_aQual 0 0 0 0 0 0 # tot.counts 26792115 24519985 27416297 25783685 33081391 19381676 # SRR1039514 SRR1039515 SRR1039516 SRR1039517 SRR1039518 SRR1039519 # __alignment_not_unique 8865980 5533574 7349997 7829016 7357178 5435207 # __ambiguous 1021469 607559 688966 890454 809693 575376 # __no_feature 3464421 2230297 2530061 3020635 2678110 1990138 # __not_aligned 0 0 0 0 0 0 # __too_low_aQual 0 0 0 0 0 0 # tot.counts 45865162 27862053 32316085 39563928 35371127 25763223 # SRR1039520 SRR1039521 SRR1039522 SRR1039523 # __alignment_not_unique 5019316 5159442 6941580 7078894 # __ambiguous 554414 613151 709211 818199 # __no_feature 1939908 1931496 2641504 2825142 # __not_aligned 0 0 0 0 # __too_low_aQual 0 0 0 0 # tot.counts 24653719 26874854 32888531 35869342 # transpose table t(data.all.summary) # write summary to file write.csv(data.all.summary, file = "chr22-htseq_counts_all-summary.csv") #################################### # take all data rows to a new table data.all <- data[grep("^ENS", rownames(data), perl=TRUE, invert=FALSE), ] # final merged table head(data.all, 3) # write data to file write.csv(data.all, file = "chr22-htseq_counts_all.csv") # cleanup intermediate objects rm(y, z, i, DT)
The resulting 16-column by 57773 rows data table starts like shown below
# ENSG00000000003 667 434 721 444 862 401 958
# ENSG00000000005 0 0 0 0 0 0 1
# ENSG00000000419 430 488 417 477 556 334 796
# SRR1039515 SRR1039516 SRR1039517 SRR1039518 SRR1039519 SRR1039520 SRR1039521
# ENSG00000000003 853 1133 1050 1087 1104 750 562
# ENSG00000000005 0 0 0 1 0 0 0
# ENSG00000000419 409 529 719 669 465 378 468
# SRR1039522 SRR1039523
# ENSG00000000003 1075 826
# ENSG00000000005 0 0
# ENSG00000000419 487 622
A comparable script can be used if you have computed counts from gtf tophat mappings
htseq-combine_gtf.R script
#!/usr/bin/RScript # htseq-combine_gtf.R # copy this code to RStudio and adapt file locations to match yours # Take 'gtf' htseq-count results and melt them in to one big dataframe # where are we? basedir <- "/work/TUTORIALS/NGS_RNASeqDE-training2015" setwd(basedir) cntdir <- paste(basedir, "htseq_counts", sep="/") pat <- "_gtf_counts.txt" tophat.gtf <- list.files(path = cntdir, pattern = pat, gtf.files = TRUE, recursive = FALSE, ignore.case = FALSE, include.dirs = FALSE) # we choose the 'gtf' series myfiles <- tophat.gtf DT <- list() # read each file as array element of DT and rename the last 2 cols # we created a list of single sample tables for (i in 1:length(myfiles) ) { infile = paste(cntdir, myfiles[i], sep = "/") DT[[myfiles[i]]] <- read.table(infile, header = F, stringsAsFactors = FALSE) cnts <- gsub("(.*)_gtf_counts.txt", "\\1", myfiles[i]) colnames(DT[[myfiles[i]]]) <- c("ID", cnts) } # merge gtf elements based on first ID columns data <- DT[[myfiles[1]]] # inspect head(data) # we now add each other table with the ID column as key for (i in 2:length(myfiles)) { y <- DT[[myfiles[i]]] z <- merge(data, y, by = c("ID")) data <- z } # ID column becomes rownames rownames(data) <- data$ID data <- data[,-1] ## add total counts per sample data <- rbind(data, tot.counts=colSums(data)) # inspect and look at the top row names! head(data) tail(data) #################################### # take summary rows to a new table # ( not starting with ENS with invert=TRUE ) # transpose table for readability data.gtf.summary <- data[grep("^ENS", rownames(data), perl=TRUE, invert=TRUE), ] # review data.gtf.summary # transpose table t(data.gtf.summary) # write summary to file write.csv(data.gtf.summary, file = "chr22-htseq_counts_gtf-summary.csv") #################################### # take gtf data rows to a new table data.gtf <- data[grep("^ENS", rownames(data), perl=TRUE, invert=FALSE), ] # final merged table head(data.gtf, 3) # write data to file write.csv(data.gtf, file = "chr22-htseq_counts_gtf.csv") # cleanup intermediate objects rm(y, z, i, DT)
Create a table describing the samples
A metadata table that will be precious for the next exercises was also created with a direct download of online resources as detailed below.
create_metadata-table.R script
#!/usr/bin/RScript # create_metadata-table.R ## create a metadata table that will be used later # this table will contain information required to split the samples # into groups and define contrasts for the differential expression analysis. # You may build such table using your favorite spreadsheet editor or # let R make it for you as shown below # this code is saved under 'build-metadata.R' library("stringr") PID <- "SRP033351" ena.url <- paste("http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=", PID, "&result=read_run", "&fields=run_accession,read_count", sep="") # assemble into a data.frame metadata <- as.data.frame(read.table(url(ena.url), header=TRUE, sep="\t")) # add missing information about cells & treatment samples=c("N61311_untreated", "N61311_Dex", "N61311_Alb", "N61311_Alb_Dex", "N052611_untreated", "N052611_Dex", "N052611_Alb", "N052611_Alb_Dex", "N080611_untreated", "N080611_Dex", "N080611_Alb", "N080611_Alb_Dex", "N061011_untreated", "N061011_Dex", "N061011_Alb", "N061011_Alb_Dex") metadata$samples <- samples # sample the first line t(metadata[1,]) # add new columns metadata$cells <- str_extract(metadata$sample, "\\b[A-Za-z0-9]+") metadata$treatment <- mapply(sub,paste(metadata$cells,"_",sep=""), "", metadata$sample) # save to file for reuse write.table(metadata, file="GSE52778_metadata.txt", row.names=F, sep="\t", quote=FALSE) # view table metadata
1 SRR1039508 22935521 N61311_untreated N61311 untreated
2 SRR1039509 21155707 N61311_Dex N61311 Dex
3 SRR1039510 22852619 N61311_Alb N61311 Alb
4 SRR1039511 21938637 N61311_Alb_Dex N61311 Alb_Dex
5 SRR1039512 28136282 N052611_untreated N052611 untreated
6 SRR1039513 43356464 N052611_Dex N052611 Dex
7 SRR1039514 40021727 N052611_Alb N052611 Alb
8 SRR1039515 29454748 N052611_Alb_Dex N052611 Alb_Dex
9 SRR1039516 30043024 N080611_untreated N080611 untreated
10 SRR1039517 34298260 N080611_Dex N080611 Dex
11 SRR1039518 30441200 N080611_Alb N080611 Alb
12 SRR1039519 31533740 N080611_Alb_Dex N080611 Alb_Dex
13 SRR1039520 34575286 N061011_untreated N061011 untreated
14 SRR1039521 41152075 N061011_Dex N061011 Dex
15 SRR1039522 28406812 N061011_Alb N061011 Alb
16 SRR1039523 31099538 N061011_Alb_Dex N061011 Alb_Dex
download exercise files
Download exercise files here.
References:
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.3 | NGS_RNASeq_DE_Exercise.5 ]