NGS RNASeq DE Exercise.8
Wrap-Up comparison of RNASeq results obtained in the different exercises
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.7 ]
Contents
Compare significant DE-calls from three commonly used packages
One obvious comparison can be made by taking the significant hits from one caller (corr-pVal<0.1) and plot their logFC value against that obtained with a second caller. If the callers are identical, the dots should fall on the diagonal. We show here pairwise comparisons between calls from DESeq2 (the most common and stringent caller use today) and results obtained in NGS_RNASeq_DE_Exercise.7 using the easy RobiNA software (DESeq1 and EdgeR).
R-code to plot the comparison script
#!/usr/bin/RScript # compare-DESeq_LFC.R # copy this code to RStudio and adapt file locations to match yours basedir <- "/work/TUTORIALS/NGS_RNASeqDE-training2015" setwd(basedir) # load RobiNA results from files robina.deseq.file <- paste(basedir, "RobiNA/RobiNA_Deseq1rev/detailed_results/full_table_untr-dex.txt", sep="/") robina.deseq1 <- read.delim(robina.deseq.file, stringsAsFactors=FALSE) robina.deseq1 <- robina.deseq1[order(robina.deseq1$padj),] rownames(robina.deseq1) <- NULL head(robina.deseq1) robina.edger.file <- paste(basedir, "RobiNA/RobiNA_EdgeR/detailed_results/full_table_dex-untr.txt", sep="/") robina.edger <- read.delim(robina.edger.file, stringsAsFactors=FALSE) robina.edger <- robina.edger[order(robina.edger$FDR),] rownames(robina.edger) <- NULL colnames(robina.edger)[1] <- "id" head(robina.edger) # load DESeq2 results from files deseq2.file <- "Dex_vs_untreated_results.csv" deseq2_results <- read.csv(deseq2.file, stringsAsFactors=FALSE) deseq2_results <- deseq2_results[order(deseq2_results$padj),] colnames(deseq2_results)[1] <- "id" head(deseq2_results) # merge DESeq merged.results <- merge(robina.deseq1, robina.edger, by='id', all = TRUE) merged.results <- merge(merged.results, deseq2_results, by='id', all = TRUE) # plot scatterplot DESeq2 vs DESeq1 png(file="DESeq2-vs_1_log2FoldChange.png") # order by increasing DESeq2 padj merged.results <- merged.results[order(merged.results$padj.y),] top.res <- merged.results[which(merged.results$padj.y<0.1),] plot(top.res$log2FoldChange.x, top.res$log2FoldChange.y, xlab="DESeq1 log2FoldChange - GSE52778", ylab="DESeq2 log2FoldChange - SRP033351", xlim=c(-4,4), ylim=c(-4,4), pch=20, cex=0.5, main="DESeq2 results with padj<0.1" ) # add red color for the top 100 hits top.100 <- head(merged.results, 100) points(top.100$log2FoldChange.x, top.100$log2FoldChange.y, col="red", lwd=2, pch=21) # add lines abline(0, 1, col="green", lty=1, lwd=2) dev.off() # plot scatterplot DESeq1 vs DESeq2 png(file="DESeq1-vs_2_log2FoldChange.png") # order by increasing DESeq1 padj merged.results <- merged.results[order(merged.results$padj.x),] top.res <- merged.results[which(merged.results$padj.y<0.1),] plot(top.res$log2FoldChange.y, top.res$log2FoldChange.x, xlab="DESeq2 log2FoldChange - SRP033351", ylab="DESeq1 log2FoldChange - GSE52778", xlim=c(-4,4), ylim=c(-4,4), pch=20, cex=0.5, main="DESeq1 results with padj<0.1" ) # add red color for the top 100 hits top.100 <- head(merged.results, 100) points(top.100$log2FoldChange.y, top.100$log2FoldChange.x, col="red", lwd=2, pch=21) # add lines abline(0, 1, col="green", lty=1, lwd=2) dev.off() # plot scatterplot DESeq1 vs EdgeR png(file="DESeq1-vs_EdgeR_log2FoldChange.png") # re-sort by increasing DESeq1 hits merged.results <- merged.results[order(merged.results$padj.x),] top.res <- merged.results[which(merged.results$padj.x<0.1),] plot(top.res$logFC, top.res$log2FoldChange.x, xlab="EdgeR log2FoldChange - GSE52778", ylab="DESeq1 log2FoldChange - GSE52778", xlim=c(-4,4), ylim=c(-4,4), pch=20, cex=0.5, main="DESeq1 results with padj<0.1" ) # add red color for th etop 100 hits top.100 <- head(merged.results, 100) points(top.100$logFC, top.100$log2FoldChange.x, col="red", lwd=2, pch=21) # add lines abline(0, 1, col="green", lty=1, lwd=2) dev.off() # plot scatterplot EdgeR vs DESeq1 png(file="EdgeR-vs_DESeq1_log2FoldChange.png") # re-sort by increasing EdgeR hits merged.results <- merged.results[order(merged.results$FDR),] top.res <- merged.results[which(merged.results$FDR<0.1),] plot(top.res$logFC, top.res$log2FoldChange.x, xlab="DESeq1 log2FoldChange - GSE52778", ylab="EdgeR log2FoldChange - GSE52778", xlim=c(-4,4), ylim=c(-4,4), pch=20, cex=0.5, main="EdgeR results with FDR<0.1" ) # add red color for th etop 100 hits top.100 <- head(merged.results, 100) points(top.100$log2FoldChange.x, top.100$logFC, col="red", lwd=2, pch=21) # add lines abline(0, 1, col="green", lty=1, lwd=2) dev.off()
Legend:
- Black dots are DE-results filtered at padj|FDR<0.1 with the most stringent caller
- Red dots are the top 100 genes after sorting by increasing padj|FDR
Compare results obtained with DESeq2 and EdgeR workflows
R-code to plot DESeq2 vs EdgeR results from ex5 and ex6 script
# compare-DESeq2-EdgeR_LFC.R # copy this code to RStudio and adapt file locations to match yours basedir <- "/work/TUTORIALS/NGS_RNASeqDE-training2015" # basedir <- "/media/bits/RNASeq_DATA/Final_results/R_results" setwd(basedir) # load DESeq2 results from files deseq2.file <- "Dex_vs_untreated_results.csv" deseq2_results <- read.csv(deseq2.file, stringsAsFactors=FALSE) deseq2_results <- deseq2_results[order(deseq2_results$padj),] colnames(deseq2_results)[1] <- "id" head(deseq2_results) # id baseMean log2FoldChange lfcSE stat pvalue padj # 1 ENSG00000165995 514.2841 3.321662 0.1307366 25.40728 2.095444e-142 3.170407e-138 # 2 ENSG00000152583 985.5593 4.340812 0.1760858 24.65169 3.529480e-134 2.670051e-130 # 3 ENSG00000120129 3325.4027 2.873150 0.1167734 24.60448 1.131211e-133 5.705074e-130 # 4 ENSG00000101347 13616.9348 3.606558 0.1517550 23.76566 7.569343e-125 2.863104e-121 # 5 ENSG00000189221 2294.7300 3.231983 0.1396172 23.14888 1.491976e-118 4.514719e-115 # 6 ENSG00000211445 12162.4869 3.540681 0.1573440 22.50281 3.895921e-112 9.824214e-109 # load EdgeR results from files edger.file <- "EdgeR-Dex_vs_untreated_results.csv" edger_results <- read.csv(edger.file, stringsAsFactors=FALSE) edger_results <- edger_results[order(edger_results$padj),] colnames(edger_results)[1] <- "id" head(edger_results) # id logFC logCPM LR PValue padj # 1 ENSG00000109906 7.330497 4.209885 1314.6167 7.530502e-288 4.350597e-283 # 2 ENSG00000165995 3.420225 4.710180 815.1903 2.688130e-179 7.765066e-175 # 3 ENSG00000152583 4.593225 5.656511 703.0385 6.530830e-155 1.257686e-150 # 4 ENSG00000171819 5.809612 3.627514 647.4904 7.850693e-143 1.133895e-138 # 5 ENSG00000163884 4.451690 4.806375 587.1334 1.052997e-129 1.216696e-125 # 6 ENSG00000101347 3.745477 9.432151 578.3239 8.682964e-128 8.360681e-124 # merge DESeq merged.results <- merge(deseq2_results, edger_results, by='id', all = TRUE) # plot scatterplot DESeq2 vs DESeq1 # order by increasing DESeq2 padj merged.results <- merged.results[order(merged.results$padj.x),] top.res <- merged.results[which(merged.results$padj.x<0.1),] plot(top.res$logFC, top.res$log2FoldChange, xlab="EdgeR logFC - SRP033351", ylab="DESeq2 log2FoldChange - SRP033351", xlim=c(-4,4), ylim=c(-4,4), pch=20, cex=0.5, main="DESeq2 results with padj<0.1" ) # add red color for the top 100 hits top.100 <- head(merged.results, 100) points(top.100$logFC, top.100$log2FoldChange, col="red", lwd=2, pch=20) # add lines abline(0, 1, col="green", lty=1, lwd=2)
Compare results obtained with DESeq2 from 'all' and 'gtf' counts
We wish here to control wether some genes would appear differentially expressed when comparing HTSeq counts obtained from untreated samples and the two Tophat workflows. The hypothesis is that mapping against the full genome or against the exomic part only (GTF file) may introduce a significant bias in the gene counts.
• Use ‘all’ and ‘gtf’ tophat2 + htseq-count results for all ‘untreated’ samples (2x4) • Melt the count tables into one dataframe • Compute group log2-means and T-test between the groups and correct for multiple testing • Identify genes-counts with a T-test result lower than (0.1 & 0.05) and plot their log2-mean in each group.
R-code to compare HTSeq counts from 'gtf' and 'all' mappings script
## copy thsi code into RStudio to reproduce the results library("knitr") # where are we? basedir <- "/work/TUTORIALS/NGS_RNASeqDE-training2015" cntdir <- paste(basedir, "htseq_counts", sep="/") setwd(basedir) # build file list pref <- c("SRR1039508", "SRR1039512", "SRR1039516", "SRR1039520") suff <- c("_all_counts.txt", "_gtf_counts.txt") myfiles <- apply(expand.grid(pref, suff), 1, paste, collapse="") # initialize DT list DT <- list() # read each file as array element of DT and rename the last 2 cols # we created a list of single sample tables for (i in 1:length(myfiles) ) { infile = paste(cntdir, myfiles[i], sep = "/") DT[[myfiles[i]]] <- read.table(infile, header = F, stringsAsFactors = FALSE) cnts <- gsub("(.*)_counts.txt", "\\1", myfiles[i]) colnames(DT[[myfiles[i]]]) <- c("ID", cnts) } # merge all elements based on first ID columns data <- DT[[myfiles[1]]] # inspect head(data) # we now add each other table with the ID column as key for (i in 2:length(myfiles)) { y <- DT[[myfiles[i]]] z <- merge(data, y, by = c("ID")) data <- z } # ID column becomes rownames rownames(data) <- data$ID data <- data[,-1] ## add total counts per sample data <- rbind(data, tot.counts=colSums(data)) # inspect and look at the top row names! head(data, 6) tail(data, 3) #################################### # take summary data to a new table # ( not starting with ENS with invert=TRUE ) # transpose table for readability data.untr.summary <- data[grep("^ENS", rownames(data), perl=TRUE, invert=TRUE), ] # review data.untr.summary # transpose table t(data.untr.summary) # write summary to file write.csv(data.untr.summary, file = "htseq_counts_untreated-all_gtf-summary.csv") kable(t(data.untr.summary[c(1:3,6),]), format="pandoc") #################################### # take count data to a new table data.untr <- data[grep("^ENS", rownames(data), perl=TRUE, invert=FALSE), ] # final merged table head(data.untr, 3) # cleanup intermediate objects rm(y, z, i, DT, data) ## We then define the two groups and compute descriptive statistics using R built-in functions. # groups all.grp <- grep( "_all", colnames(data.untr) ) gtf.grp <- grep( "_gtf", colnames(data.untr) ) # compute mean for each group t.test.p.value <- function(...) { obj<-try(t.test(...), silent=TRUE) if (is(obj, "try-error")) return(NA) else return(obj$p.value) } foo <- function(x, digits = 2L, ...) { x <- c(x, recursive = TRUE, use.names = FALSE) res <- c( gtfmean = mean(x[gtf.grp], ...), allmean = mean(x[all.grp], ...), logfc = log( (mean(x[gtf.grp], ...)+1) / (mean(x[all.grp], ...)+1), 2), p.value = t.test.p.value(x[gtf.grp], x[all.grp], alternative="two.sided", paired = FALSE) ) } # add stats to raw data data.untr <- cbind( data.untr, t(apply(data.untr, 1, foo, na.rm = TRUE)) ) # add adj.pval data.untr <- cbind( data.untr, padj = p.adjust(data.untr$p.value, method="BH") ) # final merged table head(data.untr, 3) # write data to file write.csv(data.untr, file = "htseq_counts_untreated-all_gtf-counts.csv") ## To conclude about the difference between full genome mapping (all) and exon-mapping (gtf), we use the mean signal (log2-transformed) obtained for each group of 4 cell untreated cell lines in a scatter plot. # significant subsets cutoff1 <- 0.1 sigres.01 <- subset(data.untr, data.untr$padj < cutoff1) # significant genes with T-test < cutoff1 table(data.untr$padj < cutoff1) cutoff2 <- 0.05 sigres.005 <- subset(data.untr, data.untr$padj < cutoff2) # significant genes with T-test < cutoff2 table(data.untr$padj < cutoff2) title <- "Counts significantly different\nbetween 'untreated'-groups ('all' vs 'gtf')" plot( log(data.untr$gtfmean, 2), log(data.untr$allmean, 2), xlim=c(0,12), ylim=c(0,12), xlab="log2-mean gtf-counts", ylab="log2-mean all-counts", main=title, col="green1", cex=0.25, pch=20) points( log(sigres.01$gtfmean, 2), log(sigres.01$allmean, 2), col="red1", cex=0.5, pch=20) points( log(sigres.005$gtfmean, 2), log(sigres.005$allmean, 2), col="black", cex=1, pch=20) # add lines abline(0, 1, col="blue1", lty=1, lwd=2)
sample | alignment.not.unique | ambiguous | no.feature | tot.counts |
---|---|---|---|---|
SRR1039508_all | 6029848 | 609790 | 2048439 | 26792115 |
SRR1039512_all | 7174795 | 711348 | 2564408 | 33081391 |
SRR1039516_all | 7349997 | 688966 | 2530061 | 32316085 |
SRR1039520_all | 5019316 | 554414 | 1939908 | 24653719 |
SRR1039508_gtf | 2841152 | 743579 | 1330707 | 24488909 |
SRR1039512_gtf | 3427386 | 859616 | 1741674 | 30345849 |
SRR1039516_gtf | 3596078 | 855296 | 1744829 | 29551825 |
SRR1039520_gtf | 2387593 | 656458 | 1347731 | 22766817 |
Investigating the read coverage across these genes would probably reveal mapping biases influenced by the non-exomic part of the human reference in the case of the 'all' mapping. The count of 'not.unique' mappings is higher in the 'all' HTSeq data while more 'ambiguous' mappings are reported in the 'gtf' HTSeq data.
The scatter-plot of gene count means reveals biases between the two workflows and a random distribution of pseudo-significant counts in the whole expression range.
Legend:
- green dots represent mean values of genes-counts not significantly different between the tophat 'all' and 'gtf' sample groups
- red dots had T-test results below 0.1
- black dots obtained a T-test value lower than 0.05
We see in this final graph that a number of genes appear differentially expressed when comparing HTSeq counts obtained from the two tophat2 workflows applied to the same original reads.
Closing Remarks
- Most significant calls from one caller remain significant with a second caller, which is quire reassuring and fortunate.
- DESeq2 leads to different results as compared to DESeq1 or EdgeR, especially for extreme calls
- When EdgeR is used as significance filter, a number of black dots appear that are not confirmed by DESeq (EdgeR is known to be less stringent and call false-positive DE genes).
- The choice of the Tophat reference (full genome or exome) has some effect on the counts in a given number of genes and may bias the differential expression analysis.
download exercise files
Download exercise files here.
References:
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.7 ]