NGS RNASeq DE Exercise.8

From BITS wiki
Jump to: navigation, search

Wrap-Up comparison of RNASeq results obtained in the different exercises

nutshell.png

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.7 ]


Compare significant DE-calls from three commonly used packages

One obvious comparison can be made by taking the significant hits from one caller (corr-pVal<0.1) and plot their logFC value against that obtained with a second caller. If the callers are identical, the dots should fall on the diagonal. We show here pairwise comparisons between calls from DESeq2 (the most common and stringent caller use today) and results obtained in NGS_RNASeq_DE_Exercise.7 using the easy RobiNA software (DESeq1 and EdgeR).

R-code to plot the comparison script

#!/usr/bin/RScript
# compare-DESeq_LFC.R
# copy this code to RStudio and adapt file locations to match yours
 
basedir <- "/work/TUTORIALS/NGS_RNASeqDE-training2015"
setwd(basedir)
 
# load RobiNA results from files
robina.deseq.file <- paste(basedir, 
      "RobiNA/RobiNA_Deseq1rev/detailed_results/full_table_untr-dex.txt", sep="/")
robina.deseq1 <- read.delim(robina.deseq.file, stringsAsFactors=FALSE)
robina.deseq1 <- robina.deseq1[order(robina.deseq1$padj),]
rownames(robina.deseq1) <- NULL
head(robina.deseq1)
 
robina.edger.file <- paste(basedir, 
      "RobiNA/RobiNA_EdgeR/detailed_results/full_table_dex-untr.txt", sep="/")
robina.edger <- read.delim(robina.edger.file, stringsAsFactors=FALSE)
robina.edger <- robina.edger[order(robina.edger$FDR),]
rownames(robina.edger) <- NULL
colnames(robina.edger)[1] <- "id"
head(robina.edger)
 
# load DESeq2 results from files
deseq2.file <- "Dex_vs_untreated_results.csv"
deseq2_results <- read.csv(deseq2.file, stringsAsFactors=FALSE)
deseq2_results <- deseq2_results[order(deseq2_results$padj),]
colnames(deseq2_results)[1] <- "id"
head(deseq2_results)
 
# merge DESeq
merged.results <- merge(robina.deseq1, robina.edger, by='id', all = TRUE) 
merged.results <- merge(merged.results, deseq2_results, by='id', all = TRUE) 
 
# plot scatterplot DESeq2 vs DESeq1
png(file="DESeq2-vs_1_log2FoldChange.png")
 
# order by increasing DESeq2 padj
merged.results <- merged.results[order(merged.results$padj.y),]
 
top.res <- merged.results[which(merged.results$padj.y<0.1),]
plot(top.res$log2FoldChange.x, top.res$log2FoldChange.y,
     xlab="DESeq1 log2FoldChange - GSE52778",
     ylab="DESeq2 log2FoldChange - SRP033351",
     xlim=c(-4,4),
     ylim=c(-4,4),
     pch=20,
     cex=0.5,
     main="DESeq2 results with padj<0.1"
  )
 
# add red color for the top 100 hits
top.100 <- head(merged.results, 100)
points(top.100$log2FoldChange.x, top.100$log2FoldChange.y,
       col="red", lwd=2, pch=21)
# add lines
abline(0, 1, col="green", lty=1, lwd=2)
dev.off()
 
# plot scatterplot DESeq1 vs DESeq2
png(file="DESeq1-vs_2_log2FoldChange.png")
 
# order by increasing DESeq1 padj
merged.results <- merged.results[order(merged.results$padj.x),]
 
top.res <- merged.results[which(merged.results$padj.y<0.1),]
plot(top.res$log2FoldChange.y, top.res$log2FoldChange.x,
     xlab="DESeq2 log2FoldChange - SRP033351",
     ylab="DESeq1 log2FoldChange - GSE52778",
     xlim=c(-4,4),
     ylim=c(-4,4),
     pch=20,
     cex=0.5,
     main="DESeq1 results with padj<0.1"
)
 
# add red color for the top 100 hits
top.100 <- head(merged.results, 100)
points(top.100$log2FoldChange.y, top.100$log2FoldChange.x,
       col="red", lwd=2, pch=21)
# add lines
abline(0, 1, col="green", lty=1, lwd=2)
dev.off()
 
# plot scatterplot DESeq1 vs EdgeR
png(file="DESeq1-vs_EdgeR_log2FoldChange.png")
 
# re-sort by increasing DESeq1 hits
merged.results <- merged.results[order(merged.results$padj.x),]
 
top.res <- merged.results[which(merged.results$padj.x<0.1),]
plot(top.res$logFC, top.res$log2FoldChange.x,
     xlab="EdgeR log2FoldChange - GSE52778",
     ylab="DESeq1 log2FoldChange - GSE52778",
     xlim=c(-4,4),
     ylim=c(-4,4),
     pch=20,
     cex=0.5,
     main="DESeq1 results with padj<0.1"
)
 
# add red color for th etop 100 hits
top.100 <- head(merged.results, 100)
points(top.100$logFC, top.100$log2FoldChange.x,
       col="red", lwd=2, pch=21)
# add lines
abline(0, 1, col="green", lty=1, lwd=2)
dev.off()
 
# plot scatterplot EdgeR vs DESeq1 
png(file="EdgeR-vs_DESeq1_log2FoldChange.png")
 
# re-sort by increasing EdgeR hits
merged.results <- merged.results[order(merged.results$FDR),]
 
top.res <- merged.results[which(merged.results$FDR<0.1),]
plot(top.res$logFC, top.res$log2FoldChange.x,
     xlab="DESeq1 log2FoldChange - GSE52778",
     ylab="EdgeR log2FoldChange - GSE52778",
     xlim=c(-4,4),
     ylim=c(-4,4),
     pch=20,
     cex=0.5,
     main="EdgeR results with FDR<0.1"
)
 
# add red color for th etop 100 hits
top.100 <- head(merged.results, 100)
points(top.100$log2FoldChange.x, top.100$logFC, 
       col="red", lwd=2, pch=21)
# add lines
abline(0, 1, col="green", lty=1, lwd=2)
dev.off()
DESeq1-vs_EdgeR_log2FoldChange.png
DESeq2-vs_1_log2FoldChange.png
EdgeR-vs_DESeq1_log2FoldChange.png
DESeq1-vs_2_log2FoldChange.png

Legend:

  • Black dots are DE-results filtered at padj|FDR<0.1 with the most stringent caller
  • Red dots are the top 100 genes after sorting by increasing padj|FDR

 

Compare results obtained with DESeq2 and EdgeR workflows

R-code to plot DESeq2 vs EdgeR results from ex5 and ex6 script

# compare-DESeq2-EdgeR_LFC.R
# copy this code to RStudio and adapt file locations to match yours
 
basedir <- "/work/TUTORIALS/NGS_RNASeqDE-training2015"
# basedir <- "/media/bits/RNASeq_DATA/Final_results/R_results"
setwd(basedir)
 
# load DESeq2 results from files
deseq2.file <- "Dex_vs_untreated_results.csv"
deseq2_results <- read.csv(deseq2.file, stringsAsFactors=FALSE)
deseq2_results <- deseq2_results[order(deseq2_results$padj),]
colnames(deseq2_results)[1] <- "id"
head(deseq2_results)
#                id   baseMean log2FoldChange     lfcSE     stat        pvalue          padj
# 1 ENSG00000165995   514.2841       3.321662 0.1307366 25.40728 2.095444e-142 3.170407e-138
# 2 ENSG00000152583   985.5593       4.340812 0.1760858 24.65169 3.529480e-134 2.670051e-130
# 3 ENSG00000120129  3325.4027       2.873150 0.1167734 24.60448 1.131211e-133 5.705074e-130
# 4 ENSG00000101347 13616.9348       3.606558 0.1517550 23.76566 7.569343e-125 2.863104e-121
# 5 ENSG00000189221  2294.7300       3.231983 0.1396172 23.14888 1.491976e-118 4.514719e-115
# 6 ENSG00000211445 12162.4869       3.540681 0.1573440 22.50281 3.895921e-112 9.824214e-109
 
# load EdgeR results from files
edger.file <- "EdgeR-Dex_vs_untreated_results.csv"
edger_results <- read.csv(edger.file, stringsAsFactors=FALSE)
edger_results <- edger_results[order(edger_results$padj),]
colnames(edger_results)[1] <- "id"
head(edger_results)
#                id    logFC   logCPM        LR        PValue          padj
# 1 ENSG00000109906 7.330497 4.209885 1314.6167 7.530502e-288 4.350597e-283
# 2 ENSG00000165995 3.420225 4.710180  815.1903 2.688130e-179 7.765066e-175
# 3 ENSG00000152583 4.593225 5.656511  703.0385 6.530830e-155 1.257686e-150
# 4 ENSG00000171819 5.809612 3.627514  647.4904 7.850693e-143 1.133895e-138
# 5 ENSG00000163884 4.451690 4.806375  587.1334 1.052997e-129 1.216696e-125
# 6 ENSG00000101347 3.745477 9.432151  578.3239 8.682964e-128 8.360681e-124
 
# merge DESeq
merged.results <- merge(deseq2_results, edger_results, by='id', all = TRUE) 
 
# plot scatterplot DESeq2 vs DESeq1
 
# order by increasing DESeq2 padj
merged.results <- merged.results[order(merged.results$padj.x),]
 
top.res <- merged.results[which(merged.results$padj.x<0.1),]
plot(top.res$logFC, top.res$log2FoldChange,  
     xlab="EdgeR logFC - SRP033351",
     ylab="DESeq2 log2FoldChange - SRP033351",
     xlim=c(-4,4),
     ylim=c(-4,4),
     pch=20,
     cex=0.5,
     main="DESeq2 results with padj<0.1"
  )
 
# add red color for the top 100 hits
top.100 <- head(merged.results, 100)
points(top.100$logFC, top.100$log2FoldChange, 
       col="red", lwd=2, pch=20)
# add lines
abline(0, 1, col="green", lty=1, lwd=2)


DESeq2_vs_EdgeR.png

 

Compare results obtained with DESeq2 from 'all' and 'gtf' counts

We wish here to control wether some genes would appear differentially expressed when comparing HTSeq counts obtained from untreated samples and the two Tophat workflows. The hypothesis is that mapping against the full genome or against the exomic part only (GTF file) may introduce a significant bias in the gene counts.

• Use ‘all’ and ‘gtf’ tophat2 + htseq-count results for all ‘untreated’ samples (2x4) • Melt the count tables into one dataframe • Compute group log2-means and T-test between the groups and correct for multiple testing • Identify genes-counts with a T-test result lower than (0.1 & 0.05) and plot their log2-mean in each group.


R-code to compare HTSeq counts from 'gtf' and 'all' mappings script

## copy thsi code into RStudio to reproduce the results
library("knitr")
 
# where are we?
basedir <- "/work/TUTORIALS/NGS_RNASeqDE-training2015"
cntdir <- paste(basedir, "htseq_counts", sep="/")
setwd(basedir)
 
# build file list
pref <- c("SRR1039508", "SRR1039512", "SRR1039516", "SRR1039520")
suff <- c("_all_counts.txt", "_gtf_counts.txt")
myfiles <- apply(expand.grid(pref, suff), 1, paste, collapse="")
 
# initialize DT list
DT <- list()
 
# read each file as array element of DT and rename the last 2 cols
# we created a list of single sample tables
for (i in 1:length(myfiles) ) {
  infile = paste(cntdir, myfiles[i], sep = "/")
 	DT[[myfiles[i]]] <- read.table(infile, header = F, stringsAsFactors = FALSE)
	cnts <- gsub("(.*)_counts.txt", "\\1", myfiles[i])
	colnames(DT[[myfiles[i]]]) <- c("ID", cnts)
}
 
# merge all elements based on first ID columns
data <- DT[[myfiles[1]]]
 
# inspect
head(data)
 
# we now add each other table with the ID column as key
for (i in 2:length(myfiles)) {
	y <- DT[[myfiles[i]]]
	z <- merge(data, y, by = c("ID"))
	data <- z
}
 
# ID column becomes rownames
rownames(data) <- data$ID
data <- data[,-1]
 
## add total counts per sample
data <- rbind(data, tot.counts=colSums(data))
 
# inspect and look at the top row names!
head(data, 6)
tail(data, 3)
 
####################################
# take summary data to a new table
# ( not starting with ENS with invert=TRUE )
 
# transpose table for readability
data.untr.summary <- data[grep("^ENS", rownames(data), perl=TRUE, invert=TRUE), ]
 
# review
data.untr.summary
 
# transpose table
t(data.untr.summary)
 
# write summary to file
write.csv(data.untr.summary, file = "htseq_counts_untreated-all_gtf-summary.csv")
 
kable(t(data.untr.summary[c(1:3,6),]), format="pandoc")
 
####################################
# take count data to a new table
 
data.untr <- data[grep("^ENS", rownames(data), perl=TRUE, invert=FALSE), ]
 
# final merged table
head(data.untr, 3)
 
# cleanup intermediate objects
rm(y, z, i, DT, data)
 
## We then define the two groups and compute descriptive statistics using R built-in functions.
 
# groups
all.grp <- grep( "_all", colnames(data.untr) )
gtf.grp <- grep( "_gtf", colnames(data.untr) )
 
# compute mean for each group
t.test.p.value <- function(...) {
  obj<-try(t.test(...), silent=TRUE)
  if (is(obj, "try-error")) return(NA) else return(obj$p.value)
}
 
foo <- function(x, digits = 2L, ...) {
  x <- c(x, recursive = TRUE, use.names = FALSE)
  res <- c( gtfmean = mean(x[gtf.grp], ...),
            allmean = mean(x[all.grp], ...), 
            logfc = log( (mean(x[gtf.grp], ...)+1) / 
                           (mean(x[all.grp], ...)+1), 2),
            p.value = t.test.p.value(x[gtf.grp], x[all.grp], 
                            alternative="two.sided",
                            paired = FALSE) )
  }
 
# add stats to raw data
data.untr <- cbind( data.untr, t(apply(data.untr, 1, foo, na.rm = TRUE)) )
 
# add adj.pval
data.untr <- cbind( data.untr, padj = p.adjust(data.untr$p.value, method="BH") )
 
# final merged table
head(data.untr, 3)
 
# write data to file
write.csv(data.untr, file = "htseq_counts_untreated-all_gtf-counts.csv")
 
## To conclude about the difference between full genome mapping (all) and exon-mapping (gtf), we use the mean signal (log2-transformed) obtained for each group of 4 cell untreated cell lines in a scatter plot.
 
# significant subsets
cutoff1 <- 0.1
sigres.01 <- subset(data.untr, data.untr$padj < cutoff1)
 
# significant genes with T-test < cutoff1
table(data.untr$padj < cutoff1)
 
cutoff2 <- 0.05
sigres.005 <- subset(data.untr, data.untr$padj < cutoff2)
 
# significant genes with T-test < cutoff2
table(data.untr$padj < cutoff2)
 
title <- "Counts significantly different\nbetween 'untreated'-groups ('all' vs 'gtf')"
plot( log(data.untr$gtfmean, 2), log(data.untr$allmean, 2),
      xlim=c(0,12),
      ylim=c(0,12),
      xlab="log2-mean gtf-counts",
      ylab="log2-mean all-counts",
      main=title, 
      col="green1", cex=0.25, pch=20)
 
points( log(sigres.01$gtfmean, 2), log(sigres.01$allmean, 2),
      col="red1", cex=0.5, pch=20)
 
points( log(sigres.005$gtfmean, 2), log(sigres.005$allmean, 2),
        col="black", cex=1, pch=20)
 
# add lines
abline(0, 1, col="blue1", lty=1, lwd=2)
sample alignment.not.unique ambiguous no.feature tot.counts
SRR1039508_all 6029848 609790 2048439 26792115
SRR1039512_all 7174795 711348 2564408 33081391
SRR1039516_all 7349997 688966 2530061 32316085
SRR1039520_all 5019316 554414 1939908 24653719
SRR1039508_gtf 2841152 743579 1330707 24488909
SRR1039512_gtf 3427386 859616 1741674 30345849
SRR1039516_gtf 3596078 855296 1744829 29551825
SRR1039520_gtf 2387593 656458 1347731 22766817

Investigating the read coverage across these genes would probably reveal mapping biases influenced by the non-exomic part of the human reference in the case of the 'all' mapping. The count of 'not.unique' mappings is higher in the 'all' HTSeq data while more 'ambiguous' mappings are reported in the 'gtf' HTSeq data.

The scatter-plot of gene count means reveals biases between the two workflows and a random distribution of pseudo-significant counts in the whole expression range.


compare_all-gtf_HTSeq-counts.png

Legend:

  • green dots represent mean values of genes-counts not significantly different between the tophat 'all' and 'gtf' sample groups
  • red dots had T-test results below 0.1
  • black dots obtained a T-test value lower than 0.05

We see in this final graph that a number of genes appear differentially expressed when comparing HTSeq counts obtained from the two tophat2 workflows applied to the same original reads.  

Closing Remarks

  • Most significant calls from one caller remain significant with a second caller, which is quire reassuring and fortunate.
  • DESeq2 leads to different results as compared to DESeq1 or EdgeR, especially for extreme calls
  • When EdgeR is used as significance filter, a number of black dots appear that are not confirmed by DESeq (EdgeR is known to be less stringent and call false-positive DE genes).
  • The choice of the Tophat reference (full genome or exome) has some effect on the counts in a given number of genes and may bias the differential expression analysis.

 

download exercise files

Download exercise files here.

Use the right application to open the files present in ex8-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.7 ]