OTU creation using lotus
Exercise created by Falk Hildebrand
In this tutorial we will create a genus abundance table from two 454 sequencer runs using a pipeline called LotuS. A genus abundance table contains counts of different genera in a several of samples – Rows are the different genera and columns the samples. As a simple example, take a look at this table:
Genus bl10 bl11 bl12 bl128 bl13 Bacteria 24 52 39 63 181 Bacteroides 169 27 7 42 6 Porphyromonadaceae 370 346 621 565 224
This table tells us how often we observe unclassified Bacteria, Bacteroides and unclassified Porphyromonadaceae in the 5 samples bl10, bl11, bl12, bl128 and bl13. A matrix like this will be used for the next tutorial on numerical ecology and created from raw sequence data within this tutorial.
- 1 The data
- 2 The tools
- 3 The analysis
- 3.1 Creation of Mapping file
- 3.2 Changing the data format of the input files
- 3.3 Setting up a quality filter of the input sequence files
- 3.4 Demultiplexing and quality filter the input files
- 3.5 Mapping file creation when sequence provider provides demultiplexed files
- 3.6 Running Lotus
- 3.7 Using a different taxonomy assignment on a finished run
- 3.8 Using a custom database
- 3.9 Get everything assigned!
- 3.10 Try a different sequence clustering algorithm
- 3.11 Your own best run
- 3.12 Using Illumina data as input
In a recent experiment, we sequenced 73 samples in two 454 runs, the raw fasta and quality files are in /opt/metagenomics-data/ on the bits server. For each run we have a fasta (.fna) and quality (.qual) file. First we will copy the data to your home folder. Go to your home folder using the command cd:
Create a folder called metagenomics to store the data in:
Copy the files to this folder
cp /opt/metagenomics-data/*.* metagenomics/
Go in the metagenomics folder and uncompress the LotuStutorial.tar.gz file
tar -xvzf LotuStutorial.tar.gz
Become aware of the files required from the experimenter (command ls). You can always take a look at files and their contents using viewing commands like less.
The sequence files were multiplexed before the experiment, that is a small nucleotide sequence – the barcode - was attached to each read, specific for each experiment. A mapping file is typically used, containing the link between a sequence barcode and the name of the experiment and is essential to demultiplex the fasta files.
Lotus is actually a set of tools that were installed in the /opt/ folder. First go to the lotus website and familiarize yourself with the basic documentation.
To start the exercises, go to the directory where Lotus is installed.
From this directory you can run all the tools. To reach all data files (e.g. input files) you have to provide the path to the files: ~/metagenomics/
Creation of Mapping file
An Excel file is provided, with some basic experiment annotation. The fwd primer is given as “ACTYAAAKGAATTGACGG”, but if you search for the primer sequence in the reads (in one of the .fna files) you will not find it because you need to reverse translate the primer sequence first using this tool. So you see annotation provided by the provider is not always correct.
Lotus needs experiment annotation to map input files to barcodes. Based on the documentation on the Lotus website, create a mapping file for this experiment. This means that you need to replace the column headers of the Excel file to terms that are accepted by Lotus and that you have to indicate that there is a .fna and a .qual file for each run. The header line should be preceeded by a #. The mapping file should at least contain four columns with the following headers:
Save the file as a tab-delimited text file.
You can always test the validity of your mapping file with the command
cd /opt/lotus-1.62/lotus_pipeline/ ./lotus.pl -check_map [path to your mapping file]
If you have fastq files as input the fnaFile and qualFile columns would be replaced by one fastqFile column.
Changing the data format of the input files
Sometimes you need to transcribe data from one format to another. For instance we could transform the fasta files (.fna) to fastq files (.fq). This can be done with the program sdm, that is part of the LotuS pipeline. Take a look at the sdm help by typing ./sdm and exploring the options, e.g.
Then, using command line arguments, transcribe the fasta + qual files of the Anh experiment into fastq files. Specify:
- Anh.1.fna as input fasta file
- Anh.1.qual as input quality file
- t1.fq in ~/metagenomics/ as output fastq file
Run the command from the metagenomics folder, it means you have to provide the path to the tool.
Transform fasta + qual files into fastq files ?
/opt/lotus-1.62/lotus_pipeline/sdm -i_fna Anh.1.fna -i_qual Anh.1.qual -o_fastq t1.fq
Take a look at the output in the metagenomics folder and the log files created by sdm in the lotus_pipeline folder:
Do the same for the t1.log file: you see that sdm is not only used to transform fasta into fastq files but it is also capable of doing quality filtering on the raw reads files.
Setting up a quality filter of the input sequence files
Since we want to make sure the quality filtering of the input file is strict, LotuS offers several quality filtering options. Quality settings are different for different data formats, that´s why Lotus offers a file with specific settings for each platform. Since we have 454 data we will use the file sdm_454.txt in the lotus-pipeline folder. First copy the file from this folder to ~/metagenomics/ and then take a look at the file:
cp /opt/lotus-1.62/lotus_pipeline/sdm_454.txt ~/metagenomics/ cd ~/metagenomics/ less sdm_454.txt
Read the comments (line starting with “#”) to each option and think which quality filtering options might be important in order to create OTUs from the raw sequences. (Hint: an OTU is a clustering of similar sequences with the aim to have one cluster of sequences for each species that was originally present in the samples. Take into account that sequencing machines make errors and that PCR amplification of the 16S rDNA is similarly with errors). Think about a set of parameters, including the statistical information from step 2, and save these in your copy of sdm_options.txt for later use.
Check the sdm quality filter settings. Some of the default filter settings are:
- MinSeqLength=250 : Only use reads of at least 250 nt long after processing (remember we are working with long reads from 454 sequencing)
- TruncateSequenceLength = 250 : Cut all reads after 250 nt
- QualWindowWidth = 50 and QualWindowThreshold = 25 : Remove all reads where average quality is <= 25 over a 50bp window
- maxAccumulatedError = 0,5 : Remove all remaining bases when accumulated error score >= 0,5
- maxHomonucleotide = 8 : Remove all reads with a homonucleotide run (repeat of same nt) >= 8
- RejectSeqWithoutFwdPrim = TRUE : Remove all reads that do not contain the forward primer
Demultiplexing and quality filter the input files
For this step you will need the mapping file from Step 1 and the file with the quality filtering settings for 454 data. Use the sdm command to demultiplex and filter all input files at the same time into a local folder demultDir. First create the folder where the demultiplexed files will be written in ~/metagenomics/:
Demultiplex and quality filter files using the following parameters:
- Since the mapping file contains information on all files, you have to provide the input path to the folder that contains the input files (.fna + .qual)
- t1.fq is the output fastq file
- demultDir is the output folder for the demultiplexed files
- the mapping file is in the ~/metagenomics folder
- the options are in the sdm_454.txt file
Run this command from the metagenomics folder, it means you have to provide the full path to the tool.
Demultiplex and quality filter files ?
/opt/lotus-1.62/lotus_pipeline/sdm -i_path ./ -o_fastq t1.fq -o_demultiplex demultDir/ -map map.txt -options sdm_454.txt
The demultiplexing will fill the demultDir folder with fastq files.
Discuss the output files and what each of these represents. In this experiment multiple samples were sequenced in the same lane. Two lanes were used, each containing 37 samples. After sequencing, this results in two files with reads. To know which sample a read comes from, unique bar codes are incorporated into the adapter sequences. One specific bar code for each sample. In this step reads from different samples (but from the same lane thus in the same fasta file) are split over separate fastq files, one for each sample.
Mapping file creation when sequence provider provides demultiplexed files
Now that you have demultiplexed the files into a single folder, you might be aware that sequence providers often deliver files in this format: already demultiplexed into single files. In this case slight modifications to the mapping file are enough to change the input from non-demultiplexed large file(s) to demultiplexed-many-small-files.
Note that lotus has a special script that creates the mapping file for you in this case. The script is autoMap.pl. It is used to link SampleIDs to demultiplexed files. Run autoMap from within the /opt/lotus-1.62/lotus_pipeline/ folder.
cd /opt/lotus-1.62/lotus_pipeline/ ./autoMap.pl ~/metagenomics/demultDir/ ~/metagenomics/automap.txt 1,1
We will run Lotus on the demultiplexed files. Use:
- the mapping file you generated in Step 5 for the -m parameter
- the output folder is ~/metagenomics/testR/
- the input files are in the demultir folder
Run lotus ?
./lotus.pl -i ~/metagenomics/demultDir/ -o ~/metagenomics/testR/ -m ~/metagenomics/automap.txt
In case you haven't done any quality filtering yet, you can still do it now. The command would then be extended with -s ~/metagenomics/sdm_454.txt
- Peek at the file hiera_RDP (using head). The file maps eachg OTU to a genus.
- Peek at the file OTU.txt (using head). The first line contains the number of reads that represent OTU_1 in each sample.
- Peek at the file otus.fa (using head). It contains the reads representing each OTU. You can use this file to blast the sequences to check if they are really from the OTU they were assigned to.
- Go to the folder higherLvl. This contains the data that we are going to use in the Ecology analysis.
- Go to the folder LotuSLogs. This contains the settings of the analysis. For instance, peek a the file demulti.log: it shows how many reads were rejected... The file citations.txt contains the references for reporting your LotuS results.
Using a different taxonomy assignment on a finished run
In this step we want to reassign the taxonomy to a LotuS run, but keep exactly the same OTUs. In this exercise, assign the OTUs to the Silva taxonomy.
To try this, we will copy our current output folder (so we keep the original), and on the copy (testR2), we will reassign the taxonomy. This option is useful, if e.g. you want to keep your work on a given OTU set (as well as the phylogenetic tree), but want to try out what would have happened if you had used e.g. Silva as reference database instead of utax. To do this add:
- SLV as reference database
- set -redoTaxOnly to 1
Run this command as superuser
Reassign taxonomy ?
cp -r ~/metagenomics/testR/ ~/metagenomics/testR2/ ./lotus.pl -i ~/metagenomics/demultDir/ -o ~/metagenomics/testR2/ -m ~/metagenomics/automap.txt -s ~/metagenomics/sdm_454.txt -refDB SLV -redoTaxOnly 1
Using a custom database
The research of honey bee gut communities have very specific taxonomic names for already known bacteria. In order to accomodate for their naming sheme, we will use a very specific database that contains 16S sequences of bacteria mostly found in the honey bee gut. Download the mapping file, the bee taxonomy in tax format and the bee taxonomy in fna format.
Use the two provided files (fna, tax) to again redo the taxonomy, but this time assigning first using the honey bee DB and secondly everything with low hit should be assigned with the SILVA database.
Use honey bee taxonomy database ?
./lotus.pl -i XX -o ~/metagenomics/testR3/ -redoTaxOnly 1 -m ~/metagenomics/LASI_Spring_2_bees_barn_3_map_lts_5.txt -refDB ~/metagenomics/beeTax.fna,SLV -tax4refDB ~/metagenomics/beeTax.tax
Get everything assigned!
In this step we want to assign every OTU sequence to a database target – and we don’t care about false positive assignments! Of course this is per se wrong, but in some cases you just want to know what the best hit would be, even if it is only 90% similar to your OTU sequence. LotuS provides several options that allow tweaking towards more lenient assignments. Find all options related to this and try to create the most extreme case with these options, by reassigning the taxonomy again as in the previous step.
Try a different sequence clustering algorithm
Now rerun lotus, but try to optimize for a lot of small, hard defined OTUs (that might correspond to something like strain level). Which clustering algorithm might be suitable? Which clustering cutoffs make sense? For this specific run, use the first mapping file you created (step 1) and the non-demultiplexed input files. Save this output in the folder testR4
Your own best run
Now that you have run LotuS with various databases and options, go back and look at the output folder of the different runs, look at the statistics provided in the LotuSLogS subfolder. Based on this, tune the sdm filtering parameter file from step 3 (again), choose the database you think best suited/most interesting, and choose a clustering algorithm. With this create run the sample set again, saving the output in folder testrun1.3. This output folder you can use in the following session on numerical ecology.
If LotuS run has finished, go to the specified output folder and copy the genus.txt from the output folder to your home folder:
cp testrun1.3/ higherLvl/genus.txt ~
Using Illumina data as input
In all the analysis before we were using 2 x 454 runs from an outdated next generation sequencing technology. For the next exercise we will look at the output of an Illumina miSeq sequencing platform, that is still being used a lot nowadays.
Set up the mapping file, using the provided Miseq.xlsx file. Run LotuS, after you set up a custom sdm configuration file and using a combination of parameters you learned about in previous steps.
This run might take some time longer to finish. Be sure you set it to use all the cores of your computer and let it run over the lunch break.
Congratulations, now you know how to process raw sequence files to meaningful summary tables, that can be directly analyzed in R or even Excel! In the next tutorial this matrix will be analyzed with the help of R, after the lunch break.