Ecology analysis using vegan

From BITS wiki
Jump to: navigation, search

Exercise created by Falk Hildebrand

[ Main_Page | NGS_data_analysis ]

In this exercise we will look at a data matrix of 16S rRNA counts in 74 samples.

This dataset is the microbiota composition of 74 mice from 5 different mice strains. The original research aim was to define the effect that the mouse genome has on the microbiota and what the effect of living in the same cage would be. However, we found much stronger trends in the data, and these we will look at in this exercise.

The 454 data was already compiled into a matrix with genus abundance per sample in a previous step. This matrix is called a feature abundance matrix, or abundance matrix for short. We will do an ecology-oriented analysis of the data, in later steps also taking metadata (experimental, environmental or clinical data that was collected for each sample, independent of the DNA) into account. The aim of this tutorial is to get an idea of the very basic steps of ecological data analysis using the programming language R.

The gene abundance table (Genus.txt) is already in the ~/metagenomics/testR/higherLvl/ folder or can be downloaded from the lotus website.

Set working directory

Set the folder with the provided files as your working directory in R using setwd. This way required files can be easily loaded. To find out how to use this command, you can type

?setwd()

to open the help. If there are other R-commands that you want to know more about, you can open the R-help for that command by running ?command. This will be very useful when working with R, make sure to use this a lot as you can only learn more :o).

Load the data into a matrix

Load the provided data into the matrix M (Genus.txt, actual genus abundance data), using the read.delim command, saving the loaded table as M. Make sure, the row names are correctly read in.

As R reads the matrix as an object of class data.frame, we convert M from a data.frame to a matrix M=as.matrix(M). This is important for some of the following calculations, where we need a matrix class object.

The matrix you loaded represents the number of 16S sequences assignable to each genus, which we could find in the samples. Also note that not all genera are real genera, but partly assigned unknown sequences. With these groups we do not know if this is a single genus or in fact several genera or in extreme cases even several classes, that just all fall under the same phylum tag. What are the advantages and disadvantages of keeping such undefined groups in the data?

Use the function edit(M) to better view the abundance matrix.

Take a look at the data

Let’s look at some basic features of the abundance matrix. The summary(M) command is a good start, but also look at total row and column counts (colSums, rowSums command). To see how the genera are distributed within each sample, we will plot a sample-wise density plot.We will be using a combination of the density, lines and lapply functions, to draw the densities of values found in each sample. Let’s start with looking at the density of the first sample. In R you can access specific columns by writing the matrix coordinates in square brackets. For example M[1,] shows the first row of a matrix, M[,7] shows the 7th column etc:

Look at the object densityOfSample1 by simply entering the object name into the command prompt. Next try to visualize it with plot(densityOfSample1). In this plot you see that most genera are at 0 abundance, some genera have an abundance <10 and some rare genera actually occur with a higher frequency, one genus even having ~1100 16S reads assigned to it. Which genus is this?

Alternatively you can also use the function hist, to plot a histogram of the abundances. Try to do this now.

We can use the apply command, to apply the density command to every column of M, which will return a list of density objects. The second argument to the apply function is the margin and is set to 2, which tells the apply function that we want to work on columns (margin = 2) and not on rows (margin = 1). Save this into object S_densities.

To plot this start with:

# open a new plot window and set range of x and y axis
plot(1,1,type="n",ylim=c(0,3),xlim=c(0,5000))

This will open a new plotting window, already set to the range of x and y coordinates (xlim, ylim) we will need in this example. In this case we just want to plot a blank space, this is done with the type=n argument. Try to replace the argument by type=p, to actually see that point! S_densities is a list, so we use lapply (list apply), in combination with the lines function, try this now to plot all the density lines into the open plot window.

What you should see now in the plot window is the density distribution of all samples. The lines function is adding new lines, while a plot function makes a completely new plot. Try to replace the lines with plot to see this (it’s very fast, so keep a close eye on your plot). How are these lines already telling us something about the differences between the communities of each sample?

Normalize data

Maybe you noticed that the colSums command showed that the totals are not equal. What does this mean? In this state the data is actually not comparable among each other. One way to correct the data for this shortcoming is to normalize the matrix. In this step we will normalize the abundance matrix into variable M1:

# normalize matrix: divide each column by the total of that column
M1 = sweep(M,2,colSums(M),"/")

The sweep command is extremely useful, as it will apply a simple arithmetic operation (like divide) in a matrix column- or row-wise with a vector of your choice. So it is very similar to apply, but takes more basic functions. In this case we will divide each column by the sum of the column, this is called normalization.
Now we will compare these matrices using the barplot function. For this we need to open another graphical window, using the X11 function:

# create barplot of original and normalized data
barplot(M)
X11()
barplot(M1)

What do you notice about the sample composition? What does the graph mean? Discuss where you would want to normalize the data (and where not).

Close all open plots.

Now replot the sample-wise density plot (as you did in step 3), but start the plot with these adapted x and y ranges. Additionally we will this time label the x- and y-axis:

# open a new plot and define ranges and titles of x and y axis
plot(1,1,type="n",ylim=c(0,80),xlim=c(0,1),xlab="relative genus abundance", ylab="Frequency of genera")
You will notice that the graph looks different from you previous plot. What changed due to the normalization? Are the samples more similar to each other using M or M1?

If you spot a difference in species abundance between two samples using matrix M, is this difference real, does it have scientific value?

Calculate the diversity

For the next step the R-library vegan is required. It is a set of functions specifically designed for ecological data analysis. The package has been installed on the server. If you were to install the package, you could do so using the command: install.packages(“vegan”). More details on the vegan web site. Load vegan, using the library command.

Let’s try to put the differences we observed in sample density into numbers. To do this, ecologists rely on the concept of diversity. Diversity describes the evenness of species distributions as well as the richness of species that are observed in a given ecological system. We will first calculate the Shannon diversity, using vegan’s diversity command. Try to do this per sample, using the apply function again. Save the result in object div.

Now we can see in action what these indices are actually doing for us. Plot the density of the sample with the lowest and highest diversity in red and blue on your previous density plot of M1, this you do by first finding out which diversity indexes are the maximum and minimum values using the which.max and which.min functions on the object div. Don’t forget to have the last density plot still open (or replot it from step 4 on M1), than add the lowest samples as a blue line and the highest sample as a red line, using the lines command.

You can now readjust the window by changing the ylim and xlim attribute in the plot function, if necessary (tip, try to rerun using ylim=c(0,180)). Try to explain why the colored samples have the highest & lowest diversity. What does this tell about an ecosystem (remember that these are genus abundances).

Raise your hand if you reached this step.

Alternative normalization methods

A different way to normalize the data is to sample exactly equal amounts of 16S rDNA for each sample in this experiment. Of course in practice this is impossible to do, but we can simulate this, by randomly selecting a subset of 16S rDNA. This is called rarefaction. Rarefy your original abundance matrix (M) into M2, using 1000 reads per sample, using the rrarefy function of vegan. Note that you need to transpose (command t()) the matrix, before giving it to rrarefy. Transform the matrix back and save it as M2. An alternative to vegan's rrarefy is the package rtk (rarefaction toolkit) that we develop and that is considerable faster for larger datasets. You can try to install rtk[1] (install.packages("rtk");) and use the function rtk instead of rrarefy.

Use colSums(M2) to check if the rarefaction worked. The main use of rarefaction is in calculating diversity and richness correctly, for this we will look in the following step at observed richness.

The concept of observed richness within a sample is pretty simple (but useful): richness describes the number of different species that occur at least once in a sample. This can be calculated in two steps:

# Species present in sample: TRUE or 1 if species is present, FALSE or 0 if species is absent
OnceOrMoreOftenPresent = M1>0

The sum of each column in this matrix will tell us how many species were detected in total within the respective sample, use the apply and sum functions , saving the result in rich1.

Compare the richness values in rich1 to the richness obtained on the rarefied matrix M2, calculated with a shortened command:

# Calculate number of present species in each sample using the rarefied data
rich2 = apply(M2>0,2,sum)

Compare rich1 and rich2 in a matrix value by value. We use the cbind command to bind two vectors column wise together, so we get a matrix with 2 columns. Order this matrix by the richness values in rich1, using the order command and accessing the vector representation with [] square brackets.

# Create new matrix with two columns: rich1 and rich2 and order rows according to rich1 values
cbind(rich1,rich2)[order(rich1),]

What does the second part of the formula do? What happens if you change that to order(rich2)?

Discuss which richness values have the highest value to the researcher and why the order is very different between these two richness estimates. Is one way clearly wrong?

Why did we choose 1000 as cutoff for the sequences per sample? What is the maximum value we could choose?

Cluster the samples

First samples are clustered to see underlying data structures. For this tutorial we will choose a hierarchical clustering, based on a bray-curtis distance between samples, using the function vegdist. Make sure the distances are calculated between Samples and not Genera.

Next, use the function hclust on the distance matrix, saving the output in variable cluster, and subsequently plot the clustering of the samples (using plot). Take a guess of how many groups there might be in this clustering?

To visualize the samples and their relatedness to each other in a two-dimensional space, we can use an ordination to visualize the data in a low dimensional space. The dimensionality of the original matrix (73 genera=73 dimensions) is reduced to two dimensions. If you know what a PCA (Principal component analysis) is, this step will use a conceptually similar, but methodologically quite different technique to perform an ordination of the data, NMDS (non-metric multidimensional scaling).

Start by calculating a 2-dimensional NMDS of the data using M1, using the Bray-Curtis distance in the function metaMDS, saving the result to nmds. Again, make sure that samples are being ordinated and not Genera.

Take a look at the nmds object and explore some of its features (e.g. type str(nmds) to see what variables are stored within the NMDS object). Try to find out what the stress of your ordination is. What does stress stand for (tip: go to the R help on metaMDS)? Next we can visualize the NMDS, similar to what you get out of PCA’s, displaying samples only:

# plot NMDS
plot(nmds,display ="sites")

The important difference of NMDS compared to PCA is, that NMDS works with any kind of distance metric, while PCA can only use Euclidean distances between samples. A second important feature of NMDS is, that this method finds non-parametric, monotonic relationships between objects; in short: it doesn’t assume a specific data distribution. Why might these two features be important for ecologists?

You might have noticed that you see two clusters, similar to the hierarchical clustering of the data. We can get for each sample the identity within the two clusters using the cutree commands, specifying k=2 (2 clusters). This can be plotted into the NMDS with the following command:

# identify clusters
memb = cutree(cluster, k = 2)
ordispider(nmds,memb)

Congratulations, you have just visualized the mouse enterotypes. Next we are going to look closer at these. If you want to know the exact methods to detect enterotypes in your data visit http://enterotype.embl.de/enterotypes.html

Find genera with significant differences in abundance

In the last step, we will test for all the genera in the matrix whether they show significant differences between two clusters. The scientific question we are posing here is: what are the significant differences in the gut microbiota of between enterotypes? We will use a non-parametric test (kruskal-wallis) to do the tests, as ecological data is in most cases not normally distributed. This test is very similar to the student t-test, and the interpretation works just the same way. Use the function kruskal.test to test the first genera (M[1,]) for significant differences between the two cluster groups (in object memb). Save the output of this command in variable Kt.

Look at the output of this function. This will show you a human readable summary of the test and the result. You can access elements of a list (Kt is a list in this case) using the $ operator. Try to extract the p-value from the Kt object.

Once you know how, we can start to calculate the significance for every genus in the M1 matrix,. These p-values we will store in a newly created vector pvals. Let’s add the first 2 p-values to the vector:

# Test if there is a difference between the two clusters for the first and second genera. Store p-values in a vector.
pvals = c()
pvals[1] = kruskal.test(M1[1,], memb)$p.value
pvals[2] = kruskal.test(M1[2,], memb)$p.value

Since doing this 73 times takes a long time, we will be using a for-loop to loop over the matrix and do this for us. We could as well use the apply function, but the syntax would get a little more complicated, since we are only interested in a subpart of the result, the $p.value part. Try to write a for-loop, to calculate the p-value 73 times.

As an additional help, you can add the name of the taxa to the pvals vector using the names command (that will name a vector):

# Add names to the vector
names(pvals) = dimnames(M1)[[1]]

Which taxa are significantly different?

In this case we will use the normalized M1 matrix, can you explain why we do not use the M or M2 matrix? Would either be wrong to use?

In total we were testing in 73 genera, if their p-value was below a threshold of 0.05. What is the chance of observing data with a p-value >0.05 by random chance? How many genera do you expect to be below this threshold by random chance?

To avoid statistical errors of this kind, we will use a Benjamini-Hochberg multiple testing correction, implemented in the R function p.adjust. Save the result as qvals.

What do you see in this test? What would you report on this dataset, based on these values?

Try sorting the q-values to see the most significant differences first:

# Sorting q-values
sort(qvals)

Now that you have finished the tutorials, you should be able to analyze any new dataset of amplicon data, using the LotuS pipeline and performing a basic analysis with R, including

  • Data normalization
  • Clustering analysis
  • Ordination
  • Univariate statistics

You can always expand upon these concepts, using this tutorial as starting point. Just remember that R is a very flexible language, and all these commands can be expanded for new purposes and visualizations.

Data sources

All the material provided in this tutorial are from metagenomic study on mice knockouts. Further analysis of the data can be found in: [1]


References:
  1. Hildebrand, F., Nguyen, A. T. L., Brinkman, B., Yunta, R. G., Cauwe, B., Vandenabeele, P., … Raes, J. (2013). Inflammation-associated enterotypes, host genotype, cage and inter-individual effects drive gut microbiota variation in common laboratory mice. Genome Biology, 14(1), R4. doi:10.1186/gb-2013-14-1-r4