NGS RNASeq DE Exercise.7

From BITS wiki
Jump to: navigation, search

Use the RobiNA GUI to find DE genes using EdgeR and DESeq1 packages

RobiNA.png

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.6 | NGS_RNASeq_DE_Exercise.8 ]


Robina run through example

RobiNA is a standalone GUI with embedded R code that allows differential analysis of microarray or NGS data. It was developed by a plant group but also applies to non-plant data although without the part adding gene symbols to the DE results. The final annotation can easily be taken care of using the attached R-code.

We use for this example a slightly modified version of the count table obtained from the bioconductor airway package. The table only includes the untreated and dex-treated cell line samples (N=3) and does not look at the effect of the second drug or at the drug combination. This table was generated by state of the art tophat mapping (tuxedo workflow) and should be very similar, if not identical, to the corresponding table produced with this session code. In order for RobiNA to identify the two groups 'correctly', the original header line was modified to comply with the software. The first column was named ID, the untreated cell line samples were renamed untr_1 to _3, and the Dex samples where renamed dex_1 to _3.

Technical.png The paired status of the samples is not taken into account by RobiNA which treats them as technical replicates. This is not correct but is all what you get using this simplified interface

ID               untr_1  dex_1  untr_2  dex_2  untr_3  dex_3  untr_4  dex_4
ENSG00000000003  679     448    873     408    1138    1047   770     572
ENSG00000000005  0       0      0       0      0       0      0       0
ENSG00000000419  467     515    621     365    587     799    417     508
ENSG00000000457  260     211    263     164    245     331    233     229
ENSG00000000460  60      55     40      35     78      63     76      60
ENSG00000000938  0       0      2       0      1       0      0       0
ENSG00000000971  3251    3679   6177    4252   6721    11027  5176    7995
ENSG00000001036  1433    1062   1733    881    1424    1439   1359    1109
ENSG00000001084  519     380    595     493    820     714    696     704

RobiNA common setup steps

rc_01.png


rc_02.png


rc_03.png


rc_04.png


rc_05.png


rc_06.png


rc_07.png

RobiNA DESeq1 steps

RobiNA EdgeR steps

Add gene symbol information to the tables and compare results

The following code was used for other similar tables and can be recycled with minor changes to add a column of gene symbols to the RobiNA tables; It makes use of R packages biomaRt that query the Biomart database and can add many more annotation types than what is shown here. Please read the biomaRt package manual for more information.

RobiNA-add_symbol.R script

#!/usr/bin/RScript
# RobiNA-add_symbol.R
 
# load specific library
library("org.Hs.eg.db")
 
# preview available types
columns(org.Hs.eg.db)
 
#  [1] "ENTREZID"     "PFAM"         "IPI"          "PROSITE"      "ACCNUM"       "ALIAS"        "CHR"          "CHRLOC"       "CHRLOCEND"    "ENZYME"      
# [11] "MAP"          "PATH"         "PMID"         "REFSEQ"       "SYMBOL"       "UNIGENE"      "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"    
# [21] "UNIPROT"      "GO"           "EVIDENCE"     "ONTOLOGY"     "GOALL"        "EVIDENCEALL"  "ONTOLOGYALL"  "OMIM"         "UCSCKG"     
 
# define function to add names to the existing table
convertIDs <- function( ids, fromKey, toKey, db, ifMultiple=c( "putNA", "useFirst" ) ) {
  stopifnot( inherits( db, "AnnotationDb" ) )
  ifMultiple <- match.arg( ifMultiple )
  suppressWarnings( selRes <- AnnotationDbi::select(
    db, keys=ids, keytype=fromKey, cols=c(fromKey,toKey) ) )
  if( ifMultiple == "putNA" ) {
    duplicatedIds <- selRes[ duplicated( selRes[,1] ), 1 ]
    selRes <- selRes[ ! selRes[,1] %in% duplicatedIds, ] }
  return( selRes[ match( ids, selRes[,1] ), 2 ] )
}
 
# read RobiNA data in. edit to match your own path
basedir <- "/work/TUTORIALS/NGS_RNASeqDE-training2015/RobiNA/"
setwd(basedir)
 
#### EdgeR results
 
# EdgeR: Identifier  logFC  logCPM	PValue	FDR
datafile1 <- "RobiNA_EdgeR/detailed_results/full_table_dex-untr.txt"
 
# let res1 be a result table returned by RobiNA and containing a first column named ID with ENSG-IDs
res1 <- read.table(datafile1, header=T, row.names=1)
 
# fetch matching gene symbols
res1$symbol <- convertIDs( row.names(res1), "ENSEMBL", "SYMBOL", org.Hs.eg.db )
 
# reorder columns and bring symbols in column#1
# reorder rows by increasing p.value
res1 <- res1[,c(5,1:4)]
res1 <- res1[with(res1, order(PValue)), ]
 
# preview final table
head(res1)
 
# save table to disk
filename1 <- "EdgeR_results-all.csv"
write.csv( as.data.frame(res1), file=filename1 )
 
# count genes for FDR<0.05 and abs(logFC>=1)
table(res1$FDR<0.05 , abs(res1$logFC)>=1)
# FALSE  TRUE
# FALSE 53679  9065
# TRUE    210  1148
 
### DESeq1 results
# Deseq1: id	baseMean	baseMeanA	baseMeanB	foldChange	log2FoldChange	pval	padj
 
datafile2 <- "RobiNA_Deseq1/detailed_results/full_table_dex-untr.txt"
 
# let res1 be a result table returned by RobiNA and containing a first column named ID with ENSG-IDs
res2 <- read.table(datafile2, header=T, row.names=1)
 
# fetch matching gene symbols
res2$symbol <- convertIDs( row.names(res2), "ENSEMBL", "SYMBOL", org.Hs.eg.db )
 
# reorder columns and bring symbols in column#1
# reorder rows by increasing p.value
res2 <- res2[,c(8,1:7)]
res2 <- res2[with(res2, order(pval)), ]
 
# preview final table
head(res2)
 
# save table to disk
filename2 <- "DESeq1_results-all.csv"
write.csv( as.data.frame(res2), file=filename2 )
 
table(res2$padj<0.05 , abs(res2$log2FoldChange)>=1)
# FALSE  TRUE
# FALSE 22442 10220
# TRUE    252   555
> head(res1, 13) # EdgeR
                 symbol     logFC   logCPM       PValue          FDR
ENSG00000109906  ZBTB16  7.129259 4.156839 7.525572e-77 4.824042e-72
ENSG00000171819 ANGPTL7  5.570803 3.551126 7.983188e-59 2.558692e-54
ENSG00000127954  STEAP4  5.021176 3.716265 3.068593e-54 6.556765e-50
ENSG00000179593 ALOX15B 11.057712 1.621229 2.118021e-52 3.394234e-48
**ENSG00000152583 SPARCL1  4.540837 5.533920 3.239141e-50 4.152709e-46
ENSG00000163884   KLF15  4.453668 4.700292 2.538302e-48 2.711838e-44
ENSG00000168309 FAM107A  4.356499 2.858106 3.240754e-43 2.967697e-39
ENSG00000250978    <NA>  6.160280 1.353697 1.323731e-41 1.060672e-37
**ENSG00000096060   FKBP5  3.822147 6.897970 6.342675e-41 4.517535e-37
ENSG00000101347  SAMHD1  3.748595 9.207328 8.763185e-40 5.617377e-36
ENSG00000211445    GPX3  3.569312 9.159303 6.578585e-38 3.833641e-34
ENSG00000100033   PRODH  4.742678 1.313615 1.919896e-34 1.025577e-30
**ENSG00000162692   VCAM1 -3.564117 4.580852 2.478399e-34 1.222080e-30

> head(res2, 13) # DESeq1
                 symbol   baseMean   baseMeanA   baseMeanB  foldChange log2FoldChange         pval         padj
**ENSG00000096060   FKBP5 2564.36223  4815.07582  313.648638  0.06513888      -3.940337 1.218100e-45 4.076859e-41
ENSG00000189221    MAOA 2341.76725  4253.70269  429.831818  0.10104886      -3.306875 2.078128e-42 3.477643e-38
ENSG00000179094    <NA>  776.59667  1399.33687  153.856466  0.10994956      -3.185086 5.894244e-38 6.575815e-34
ENSG00000170214  ADRA1B  201.57641   374.10413   29.048691  0.07764868      -3.686895 4.056765e-36 3.394397e-32
**ENSG00000152583 SPARCL1  997.43977  1916.07633   78.803214  0.04112739      -4.603757 3.806190e-34 2.547788e-30
**ENSG00000162692   VCAM1  508.17023    76.57865  939.761815 12.27185141       3.617281 1.100969e-28 6.141387e-25
ENSG00000178695  KCTD12 2649.85015   794.97080 4504.729494  5.66653451       2.502467 3.476343e-25 1.662139e-21
ENSG00000250978    <NA>   56.31819   111.15372    1.482666  0.01333888      -6.228219 4.646981e-25 1.944123e-21
ENSG00000120129   DUSP1 3409.02938  6036.62368  781.435072  0.12944903      -2.949544 3.915487e-24 1.456083e-20
ENSG00000125148    MT2A 3656.25278  6007.70878 1304.796789  0.21718709      -2.202990 2.542506e-22 8.509513e-19
ENSG00000124766    SOX4 1281.46866   417.41921 2145.518101  5.13996008       2.361757 1.265499e-21 3.850452e-18
ENSG00000166741    <NA> 7432.93635 12151.58166 2714.291047  0.22336936      -2.162497 4.500001e-21 1.255088e-17
ENSG00000198624  CCDC69 2057.19173  3626.73778  487.645681  0.13445849      -2.894767 1.094766e-20 2.818516e-17

Mortasecca.png Warning: Note SPARCL1, FKBP5, and VCAM1 with opposite sign for logFC between tables


plot recall Venn diagrams using VennDiagram or limma

### plotting and VENN diagram
 
# volcano plots
png("EdgeR-volcano.png")
plot(1-res1$FDR ~ res1$logFC,
     xlim=c(-10,10),
     xlab="log FC",
     ylab="1-FDR",
     main="EdgeR data",
     pch=20, 
     cex=0.5
)
abline(h=0.95, lwd=2, col="blue")
dev.off()
 
png("DESeq1-volcano.png")
plot(1-res2$padj ~ res2$log2FoldChange,
     xlim=c(-10,10),
     xlab="log FC",
     ylab="1-FDR",
     main="DESeq1 data",
     pch=20, 
     cex=0.5
)
abline(h=0.95, lwd=2, col="blue")
dev.off()
 
# scatterplot of logFC in two methods
big <- merge(res1, res2, by=0)
 
png("log-FC_concordance.png")
plot(big$log2FoldChange ~ big$logFC,
     xlim=c(-6,6),
     ylim=c(-6,6),
     xlab="EdgeR fold change (log)",
     ylab="DESeq1 fold change (log)",
     main="fold-change concordance",
     pch=21,
     cex=0.5)
abline(a=0,b=-1,col="green",lwd=2)   
dev.off()
 
png("FDR_concordance.png")
plot(big$padj ~ big$FDR,
     xlim=c(0,0.05),
     ylim=c(0,0.05),
     xlab="EdgeR FDR",
     ylab="DESeq1 padj",
     main="fold-change concordance",
     pch=21,
     cex=0.5)
abline(a=0,b=1,col="green",lwd=2)   
dev.off()
 
# subset ENS IDs for DE genes
de1.ENS <- row.names( subset(res1, (abs(res1$logFC)>=1) & (res1$FDR<=0.05) ) )
de2.ENS <- row.names( subset(res2, (abs(res2$log2FoldChange)>=1) & (res2$padj<=0.05) ) )
 
#### plot Venn diagram to show concordance
# require("VennDiagram")
venn.plot <- venn.diagram(
  list(edgeR.DE = de1.ENS, DESeq1.DE = de2.ENS),
  filename = NULL
)
png("Venn_DE-recall.png")
grid.draw(venn.plot)
dev.off()
 
# alt using limma
library("limma")
 
Lists <- list(de1.ENS, de2.ENS)  #put the word vectors into a list to supply lapply
items <- sort(unique(unlist(Lists)))   #put in alphabetical order
 
vennData <- matrix(rep(0, length(items)*length(Lists)), ncol=2)  #make a matrix of 0's
colnames(vennData) <- paste0("List", 1:2)
rownames(vennData) <- items
lapply(seq_along(Lists), function(i) {   
  #fill the matrix
  vennData[items %in% Lists[[i]], i] <<- table(Lists[[i]])
  })
 
head(vennData)   #look at the results
 
a <- vennCounts(vennData)
 
colnames(a) <- c('edgeR.DE', 'DESeq1.DE', 'Counts')
vennDiagram(a, main='Comparison of DE genes from two callers')
title(main='RobiNA results |logFC|>1 & FDR<0.05', cex.main=0.9)
 
png("Venn_DE-recall2.png")
dev.off()

The script ends by comparing genes for which at least 2-fold differential expression with a FDR of less than 0.05 was observed from the three cell lines. As seen below, EdgeR reports many extra DE genes but also most of the DE genes found by DESeq1. This is a known fact for these two packages

EdgeR-volcano.png
DESeq1-volcano.png
log-FC_concordance.png
(concordant genes in red)
FDR_concordance.png
(partial range [0; 0.1])
Venn_DE-recall.png
Venn_DE-recall2.png

Mortasecca.png Warning: DESeq1 takes the contrast in the opposite direction as compared to EdgeR 'Control vs treated' io 'treated vs Control'

Technical.png Also note that the returned statistics are quire different between methods although both methods overlap well for the genes found DE (concordant genes in red)


running DESeq1 in reverse order

> head(res3, 13) # DESeq1 results after inverting Dex and Untreated
                 symbol   baseMean   baseMeanA   baseMeanB foldChange log2FoldChange         pval         padj
ENSG00000096060   FKBP5 2564.36223  313.648638  4815.07582 15.3518148       3.940337 1.218100e-45 4.076859e-41
ENSG00000189221    MAOA 2341.76725  429.831818  4253.70269  9.8962024       3.306875 2.078128e-42 3.477643e-38
ENSG00000179094    <NA>  776.59667  153.856466  1399.33687  9.0950800       3.185086 5.894244e-38 6.575815e-34
ENSG00000170214  ADRA1B  201.57641   29.048691   374.10413 12.8785193       3.686895 4.056765e-36 3.394397e-32
ENSG00000152583 SPARCL1  997.43977   78.803214  1916.07633 24.3146979       4.603757 3.806190e-34 2.547788e-30
ENSG00000162692   VCAM1  508.17023  939.761815    76.57865  0.0814873      -3.617281 1.100969e-28 6.141387e-25
ENSG00000178695  KCTD12 2649.85015 4504.729494   794.97080  0.1764747      -2.502467 3.476343e-25 1.662139e-21
ENSG00000250978    <NA>   56.31819    1.482666   111.15372 74.9688177       6.228219 4.646981e-25 1.944123e-21
ENSG00000120129   DUSP1 3409.02938  781.435072  6036.62368  7.7250483       2.949544 3.915487e-24 1.456083e-20
ENSG00000125148    MT2A 3656.25278 1304.796789  6007.70878  4.6043252       2.202990 2.542506e-22 8.509513e-19
ENSG00000124766    SOX4 1281.46866 2145.518101   417.41921  0.1945540      -2.361757 1.265499e-21 3.850452e-18
ENSG00000166741    <NA> 7432.93635 2714.291047 12151.58166  4.4768897       2.162497 4.500001e-21 1.255088e-17
ENSG00000198624  CCDC69 2057.19173  487.645681  3626.73778  7.4372396       2.894767 1.094766e-20 2.818516e-17

Discussion

Although not always easy to install, RobiNA presents the unique advantage to be minimalist and to bring the user very quickly to results. RobiNA is able to analyze microarray data as well as NGS data and accepts precomputed gene counts for those who do not have enough computing power to perform the mapping steps on their machine.

Obviously, more advance statistical handling will be more accurate in many cases and RobiNA only allows pairwise DE analysis but still, it is a remarkable piece of software for the beginner.

download exercise files

Download exercise files here.

Use the right application to open the files present in ex7-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.6 | NGS_RNASeq_DE_Exercise.8 ]