Analyse and quantify Metagenomic sequencing data
Identify and quantify 'non-human / metagenomics' reads in NGS data
[ Main_Page ]
Contents
- 1 Introduction
- 2 Kraken-provided merged and synthetic datasets
- 3 Run Kraken with the provided 'accuracy' datasets
- 4 Run the RTG composition-meta-pipeline on simBA5
- 5 Results for the Kraken MiSeq and HiSeq compiled datasets
- 6 Kranen performance with real data
- 7 Download data files
Introduction
Non Reference reads present in NGS data may come from sample or library prep contamination but may also be biologically relevant in case of sample mixture resulting from a metagenomic study, symbiotic relationship between a host and its guests, or samples from infected species. A popular tool to identify such non-Reference reads is Kraken [1] which makes use of a sequence database built from subsets of the NCBI data. RealTime Genomics proposes a similar pipeline presented in this white paper and that can be run without costs for Academic use.
The aim of this white paper is to introduce both methods and present the results of analysing a defined test datasets (from Kraken paper).
Repeating this work will require downloading and installing the Kraken and RTG software as well as their dependencies and reference databases.
Kraken-provided merged and synthetic datasets
The Kraken paper[1] used public data from MiSeq and HiSeq as well as synthetic data to evaluate the performance of the software. We borrowed these datasets, and the quoted sentences below, to evaluate RTG on the same data. Please refer to the Material and Methods section of the Kraken paper for more details.
- MiSeq and HiSeq data: <<The HiSeq and MiSeq metagenomes were built using 20 sets of bacterial whole-genome shotgun reads. These reads were found either as part of the GAGE-B project or in the NCBI Sequence Read Archive. Each metagenome contains sequences from ten genomes. For both the 10,000 and 10 million read samples of each of these metagenomes, 10% of their sequences were selected from each of the ten component genome data sets (i.e., each genome had equal sequence abundance). All sequences were trimmed to remove low quality bases and adapter sequences. >>[1]
The names associated with the strains of the 20 genomes are:
- Bacillus cereus VD118
- Citrobacter freundii 47N
- Enterobacter cloacae
- Klebsiella pneumoniae NES14
- Mycobacterium abscessus 6G-0125-R
- Proteus vulgaris 66N
- Rhodobacter sphaeroides 2.4.1
- Staphylococcus aureus ST22
- Salmonella enterica Montevideo str. N19965
- Vibrio cholerae CP1032(5)
HiSeq:
- Aeromonas hydrophila SSU
- Bacillus cereus VD118
- Bacteroides fragilis HMW615
- Mycobacterium abscessus 6G-0125-R
- Pelosinus fermentans A11
- Rhodobacter sphaeroides 2.4.1
- Staphylococcus aureus M0927
- Streptococcus pneumoniae TIGR4
- Vibrio cholerae CP1032(5)
- Xanthomonas axonopodis pv. Manihotis UA323
- simBA5 data: << This sample, featuring simulated bacterial and archaeal reads (called simBA-5), was created with an error rate five times higher than would be expected, to evaluate Kraken’s performance on data that contain many errors or have strong differences from Kraken’s genomic library >>[1]
Run Kraken with the provided 'accuracy' datasets
Requirements
- Kraken was installed[2] together with the minimal database minikraken[3], the standard database was also built as described in the kraken Install document.
- Jellyfish version 1[4] is required by Kraken and was also installed.
- accuracy example datasets[5] used in the Kraken paper were copied to the local disk.
- Krona[6]was installed from the separate developer site[7] to print the kraken results as pie charts.
Example of running Kraken on simBA5
# run against minikraken mkdir simBA5_minikraken kraken --threads 12 --preload \ --db $KRAKENDB/minikraken simBA5_accuracy.fa \ --fasta-input > simBA5_minikraken/simBA5_accuracy.minikraken # run against the standard kraken DB mkdir simBA5_kraken-std kraken --threads 12 --preload \ --db $KRAKENDB/standard simBA5_accuracy.fa \ --fasta-input > simBA5_kraken-std/simBA5_accuracy.kraken
Converting the Kraken results to a live-interactive dendrogram with Krona
# run for minikraken results # Get the name of each sequence (fasta header) and the assigned taxonomy ID using the output from step 1 cut -f2,3 simBA5_minikraken/simBA5_accuracy.minikraken \ > simBA5_minikraken/simBA5_accuracy.minikraken.krona.in # run krona on the extraction ImportTaxonomy.pl simBA5_accuracy.minikraken_krona.in -o simBA5_accuracy.minikraken_krona.html # run for kraken-standard results # Get the name of each sequence (fasta header) and the assigned taxonomy ID using the output from step 1 cut -f2,3 simBA5_kraken-std/simBA5_accuracy.kraken \ > simBA5_kraken-std/simBA5_accuracy.kraken_krona.in # run krona on the extraction ImportTaxonomy.pl simBA5_accuracy.kraken_krona.in -o simBA5_accuracy.kraken_krona.html
you can avoid the cut command by running ImportTaxonomy.pl with addition of two parameters '-q 2 -t 3' to specify the right columns for data
To simplify the output, only the Bacteria superkingdom is shown in the following pictures starting at level 4
Access here the live Krona page (also from the links at the bottom of this page)
Access here the live Krona page (also from the links at the bottom of this page)
Run the RTG composition-meta-pipeline on simBA5
[8]Requirements
- The RTG core code [9] and the pre-compiled RTG species database [10] were installed on our server (our institute qualifies for the free version, please check for your own access restrictions).
The meta-pipeline command
Although RTG[11] can do much more than this application, it also comes with a all-in-one meta command to perform the metagenomics quantification from NGS reads as described below.
# run against RTG species and filter databases # first convert the fasta query to a RTG SDF database rtg format --format=fasta -o simBA5_accuracy.sdf simBA5_accuracy.fa # then run the meta-pipeline rtg composition-meta-pipeline --input=simBA5_accuracy.sdf \ --output=simBA5_accuracy.rtg \ --filter=$RTG_INDEXES/filter \ --species=$RTG_INDEXES/species # the pipeline consists in several steps reproduced here from the stdout: #1 (contaminant removal- e.g. human reads in microbiome samples) ## rtg mapf --output simBA5_accuracy.rtg/mapf \ ## --template /data2/biodata/rtg_indexes/filter \ ## --sam-rg @RG\tPL:ILLUMINA\tSM:sample\tID:id \ ## --input simBA5_accuracy.sdf READ MAPPINGS 0 0.0% mapped 77 0.8% unmapped with poor hits (XC = D) 9923 99.2% unmapped with no hits 10000 100.0% total #2 (reference alignment to nucleotide sequences) ## rtg map --output simBA5_accuracy.rtg/map \ ## --template /data2/biodata/rtg_indexes/species \ ## --max-mismatches 10% \ ## --max-top-results 100 \ ## --sam-rg @RG\tPL:ILLUMINA\tSM:sample\tID:id \ ## --input simBA5_accuracy.rtg/mapf/unmapped.sdf READ MAPPINGS 4769 47.7% mapped uniquely (NH = 1) 2989 29.9% mapped ambiguously (NH > 1) 0 0.0% unmapped due to read frequency (XC = B) 232 2.3% unmapped with too many hits (XC = C) 1511 15.1% unmapped with poor hits (XC = D) 0 0.0% unmapped with too many good hits (XC = E) 499 5.0% unmapped with no hits 10000 100.0% total #3 (produce abundance estimates, genome coverage statistics, and diversity metrics) ## rtg species --output simBA5_accuracy.rtg/species \ ## --genomes /data2/biodata/rtg_indexes/species \ ## simBA5_accuracy.rtg/map/alignments.bam Diversity metrics Richness: 0.0000 Shannon: 0.0000 Pielou: -0.0000 Inverse Simpson: Infinity # finally, review the results in the report folder (and more in other folders) firefox simBA5_accuracy.rtg/report/speciesReport/index.html
To simplify the output, only the Bacteria superkingdom is shown in the following pictures starting at level 5 for both 'abundance' and 'fraction' results
Access here the live RTG-Krona page (also from the links at the bottom of this page)
RTG composition-meta-pipeline manpage
command arguments
rtg composition-meta-pipeline --help Usage: rtg composition-meta-pipeline [OPTION]... --output DIR --input SDF|FILE [OPTION]... --output DIR --input-left FILE --input-right FILE Runs a metagenomic composition pipeline. The pipeline consists of read filtering, read alignment, then species composition. File Input/Output --filter=SDF filter reference sequence (Default is /opt/biotools/rtg-core-non-commercial-3.7.1/references/filter) --input=SDF|FILE input SDF or single end fastq input --input-left=FILE left arm of paired end FASTQ input --input-right=FILE right arm of paired end FASTQ input --output=DIR output directory --platform=STRING platform of the input data. Allowed values are [illumina, iontorrent] (Default is illumina) --species=SDF SDF containing reference sequences (Default is /opt/biotools/rtg-core-non-commercial-3.7.1/references/species) Utility -h, --help print help on command-line flag usage
Results for the Kraken MiSeq and HiSeq compiled datasets
Similarly, two other datasets were created in the Kraken paper, that mimic MiSeq and HiSeq data. We ran these two using kraken-standard and RTG composition-meta-pipeline.
Results for the MiSeq merged data
# run against the standard kraken DB mkdir MiSeq_kraken-std kraken --threads 12 --preload \ --db $KRAKENDB/standard MiSeq_accuracy.fa \ --fasta-input > MiSeq_kraken-std/MiSeq_accuracy.kraken Loading database... complete. 10000 sequences (1.56 Mbp) processed in 0.364s (1650.6 Kseq/m, 257.87 Mbp/m). 8002 sequences classified (80.02%) 1998 sequences unclassified (19.98%) cut -f2,3 MiSeq_kraken-std/MiSeq_accuracy.kraken \ > MiSeq_kraken-std/MiSeq_accuracy.kraken.krona.in # run krona on the extraction ImportTaxonomy.pl MiSeq_kraken-std/MiSeq_accuracy.kraken.krona.in \ -o MiSeq_kraken-std/MiSeq_accuracy.kraken.krona.html
To simplify the output, only the Bacteria superkingdom is shown in the following pictures starting at level 4
Access here the live Krona page (also from the links at the bottom of this page)
rtg composition-meta-pipeline --input=MiSeq_accuracy.sdf --output=MiSeq_accuracy.rtg --filter=$RTG_INDEXES/filter --species=$RTG_INDEXES/species ## rtg mapf --output MiSeq_accuracy.rtg/mapf --template /data2/biodata/rtg_indexes/filter --sam-rg @RG\tPL:ILLUMINA\tSM:sample\tID:id --input MiSeq_accuracy.sdf READ MAPPINGS 0 0.0% mapped 127 1.3% unmapped with poor hits (XC = D) 9873 98.7% unmapped with no hits 10000 100.0% total ## rtg map --output MiSeq_accuracy.rtg/map --template /data2/biodata/rtg_indexes/species --max-mismatches 10% --max-top-results 100 --sam-rg @RG\tPL:ILLUMINA\tSM:sample\tID:id --input MiSeq_accuracy.rtg/mapf/unmapped.sdf READ MAPPINGS 3949 39.5% mapped uniquely (NH = 1) 4511 45.1% mapped ambiguously (NH > 1) 0 0.0% unmapped due to read frequency (XC = B) 75 0.8% unmapped with too many hits (XC = C) 487 4.9% unmapped with poor hits (XC = D) 0 0.0% unmapped with too many good hits (XC = E) 978 9.8% unmapped with no hits 10000 100.0% total ## rtg species --output MiSeq_accuracy.rtg/species --genomes /data2/biodata/rtg_indexes/species MiSeq_accuracy.rtg/map/alignments.bam Diversity metrics Richness: 15.000 Shannon: 1.4116 Pielou: 0.52128 Inverse Simpson: 42.790
To simplify the output, only the Bacteria superkingdom is shown in the following pictures starting at level 5 for both 'abundance' and 'fraction' results
Access here the live RTG-Krona page (also from the links at the bottom of this page)
Results for the HiSeq merged data
# run against the standard kraken DB mkdir HiSeq_kraken-std kraken --threads 12 --preload \ --db $KRAKENDB/standard HiSeq_accuracy.fa \ --fasta-input > HiSeq_kraken-std/HiSeq_accuracy.kraken Loading database... complete. 10000 sequences (0.92 Mbp) processed in 0.239s (2513.2 Kseq/m, 231.83 Mbp/m). 7891 sequences classified (78.91%) 2109 sequences unclassified (21.09%) cut -f2,3 HiSeq_kraken-std/HiSeq_accuracy.kraken \ > HiSeq_kraken-std/HiSeq_accuracy.kraken.krona.in # run krona on the extraction ImportTaxonomy.pl HiSeq_kraken-std/HiSeq_accuracy.kraken.krona.in \ -o HiSeq_kraken-std/HiSeq_accuracy.kraken.krona.html
To simplify the output, only the Bacteria superkingdom is shown in the following pictures starting at level 4
Access here the live Krona page (also from the links at the bottom of this page)
rtg composition-meta-pipeline --input=HiSeq_accuracy.sdf --output=HiSeq_accuracy.rtg \ > --filter=$RTG_INDEXES/filter \ > --species=$RTG_INDEXES/species ## rtg mapf --output HiSeq_accuracy.rtg/mapf --template /data2/biodata/rtg_indexes/filter --sam-rg @RG\tPL:ILLUMINA\tSM:sample\tID:id --input HiSeq_accuracy.sdf READ MAPPINGS 1 0.0% mapped 86 0.9% unmapped with poor hits (XC = D) 9913 99.1% unmapped with no hits 10000 100.0% total ## rtg map --output HiSeq_accuracy.rtg/map --template /data2/biodata/rtg_indexes/species --max-mismatches 10% --max-top-results 100 --sam-rg @RG\tPL:ILLUMINA\tSM:sample\tID:id --input HiSeq_accuracy.rtg/mapf/unmapped.sdf READ MAPPINGS 2315 23.2% mapped uniquely (NH = 1) 6732 67.3% mapped ambiguously (NH > 1) 0 0.0% unmapped due to read frequency (XC = B) 54 0.5% unmapped with too many hits (XC = C) 237 2.4% unmapped with poor hits (XC = D) 0 0.0% unmapped with too many good hits (XC = E) 661 6.6% unmapped with no hits 9999 100.0% total ## rtg species --output HiSeq_accuracy.rtg/species --genomes /data2/biodata/rtg_indexes/species HiSeq_accuracy.rtg/map/alignments.bam Diversity metrics Richness: 11.000 Shannon: 1.1145 Pielou: 0.46479 Inverse Simpson: 48.422
To simplify the output, only the Bacteria superkingdom is shown in the following pictures starting at level 5 for both 'abundance' and 'fraction' results
Access here the live RTG-Krona page (also from the links at the bottom of this page)
Kranen performance with real data
A real NextSeq dataset was chosen to measure runtime for the Kraken commands. We used two read types pre-processed by the Illumina barcode pipeline. For both datasets, the Kraken tool was run without preloading the reference as this was already present in RAM from previous searches. Within each dataset, we used either read#1 or read#1+read#2 (paired) in order to evaluate the added value of using paired read data.
In case of paired input, kraken merges the two read sequences with addition of one 'n'-base in order to mimic a long-reads
Undetermined reads
We first look at 'Undetermined' reads that did not match any barcode (these include the Phi-X spike-in reads). The first Lane was chosen arbitrarily.
Single read file but not preloading (done before and still in RAM)
time kraken --threads 12 \ --db $KRAKENDB/standard \ --fastq-input \ --gzip-compressed \ Undetermined_S0_L001_R1_001.fastq.gz \ > Undetermined_S0_L001_R1_001_noload.kraken 1098052 sequences (164.57 Mbp) processed in 8.122s (8112.1 Kseq/m, 1215.77 Mbp/m). 792816 sequences classified (72.20%) 305236 sequences unclassified (27.80%) real 0m10.348s user 1m21.067s sys 0m7.383s
Followed by Krona formatting
time ImportTaxonomy.pl \ -q 2 -t 3 \ Undetermined_S0_L001_R1_001_noload.kraken \ -o Undetermined_S0_L001_R1_001_noload.kraken.html [ WARNING ] Score column already in use; not reading scores. Loading taxonomy... Importing Undetermined_S0_L001_R1_001_noload.kraken... Writing Undetermined_S0_L001_R1_001_noload.kraken.html... [ WARNING ] Too many query IDs to store in chart; storing supplemental files in 'Undetermined_S0_L001_R1_001_noload.kraken.html.files'. real 0m37.958s user 0m37.522s sys 0m0.434s
krona results
Access here the live Krona page (also from the links at the bottom of this page)
Paired read files and no preloading
time kraken --threads 12 \ --db $KRAKENDB/standard \ --fastq-input \ --paired \ --gzip-compressed \ Undetermined_S0_L001_R1_001.fastq.gz \ Undetermined_S0_L001_R2_001.fastq.gz \ > Undetermined_S0_L001_paired_noload.kraken 1098052 sequences (330.09 Mbp) processed in 32.683s (2015.8 Kseq/m, 606.00 Mbp/m). 812547 sequences classified (74.00%) 285505 sequences unclassified (26.00%) real 0m34.948s user 2m19.602s sys 0m10.071s
Followed by Krona formatting
time ImportTaxonomy.pl \ -q 2 -t 3 \ Undetermined_S0_L001_paired_noload.kraken \ -o Undetermined_S0_L001_paired_noload.kraken.html [ WARNING ] Score column already in use; not reading scores. Loading taxonomy... Importing Undetermined_S0_L001_paired_noload.kraken... Writing Undetermined_S0_L001_paired_noload.kraken.html... [ WARNING ] Too many query IDs to store in chart; storing supplemental files in 'Undetermined_S0_L001_paired_noload.kraken.html.files'. real 0m38.824s user 0m38.211s sys 0m0.611s
krona results
Access here the live Krona page (also from the links at the bottom of this page)
barcode sorted reads
The first Lane (labelled 9-11) was chosen arbitrarily. Many more reads are present as compared to the initial 'Undetermined' dataset. The results obtained here can be compared to published usual suspect[12] like for instance Bradyrhizobium found below to 0.8-0.9% of the Bacterial species (0.01-0.02% of total reads).
Single read file and not preloading the database
time kraken \ --threads 12 \ --db $KRAKENDB/standard \ --fastq-input \ --gzip-compressed \ 911_S4_L001_R1_001.fastq.gz \ > 911_S4_L001_R1_001_noload.kraken 7849590 sequences (1170.85 Mbp) processed in 58.334s (8073.8 Kseq/m, 1204.30 Mbp/m). 138975 sequences classified (1.77%) 7710615 sequences unclassified (98.23%) real 1m2.461s user 10m1.443s sys 0m51.017s
Followed by Krona formating
time ImportTaxonomy.pl \ -q 2 -t 3 \ 911_S4_L001_R1_001_noload.kraken \ -o 911_S4_L001_R1_001_noload.kraken.html [ WARNING ] Score column already in use; not reading scores. Loading taxonomy... Importing 911_S4_L001_R1_001_noload.kraken... Writing 911_S4_L001_R1_001_noload.kraken.html... [ WARNING ] Too many query IDs to store in chart; storing supplemental files in '911_S4_L001_R1_001_noload.kraken.html.files'. real 1m29.650s user 1m28.036s sys 0m1.608s
krona results
Focussing on Bacteria allows digging into the other species as shown in the second picture.
Access here the live Krona page (also from the links at the bottom of this page)
paired read file and not preloading the database
time kraken --threads 12 \ --db $KRAKENDB/standard \ --fastq-input \ --gzip-compressed \ --paired \ 911_S4_L001_R1_001.fastq.gz \ 911_S4_L001_R2_001.fastq.gz \ > 911_S4_L001_paired_noload.fastq.kraken 7849590 sequences (2349.27 Mbp) processed in 231.315s (2036.1 Kseq/m, 609.37 Mbp/m). 253067 sequences classified (3.22%) 7596523 sequences unclassified (96.78%) real 3m55.151s user 21m3.505s sys 1m41.701s
Followed by Krona formating
time ImportTaxonomy.pl \ -q 2 -t 3 \ 911_S4_L001_paired_noload.fastq.kraken \ -o 911_S4_L001_paired_noload.fastq.kraken.html [ WARNING ] Score column already in use; not reading scores. Loading taxonomy... Importing 911_S4_L001_paired_noload.fastq.kraken... Writing 911_S4_L001_paired_noload.fastq.kraken.html... [ WARNING ] Too many query IDs to store in chart; storing supplemental files in '911_S4_L001_paired_noload.fastq.kraken.html.files'. real 1m34.377s user 1m32.630s sys 0m1.739s
krona results
Focussing on Bacteria allows digging into the other species as shown in the second picture.
Access here the live Krona page (also from the links at the bottom of this page)
References:
- ↑ 1.0 1.1 1.2 1.3
Derrick E Wood, Steven L Salzberg
Kraken: ultrafast metagenomic sequence classification using exact alignments.
Genome Biol: 2014, 15(3);R46
[PubMed:24580807] ##WORLDCAT## [DOI] (I e) - ↑ http://ccb.jhu.edu/software/kraken/
- ↑ http://ccb.jhu.edu/software/kraken/dl/minikraken.tgz
- ↑ http://www.cbcb.umd.edu/software/jellyfish/
- ↑ http://ccb.jhu.edu/software/kraken/dl/accuracy.tgz
- ↑
Brian D Ondov, Nicholas H Bergman, Adam M Phillippy
Interactive metagenomic visualization in a Web browser.
BMC Bioinformatics: 2011, 12;385
[PubMed:21961884] ##WORLDCAT## [DOI] (I e) - ↑ https://github.com/marbl/Krona/wiki
- ↑ http://realtimegenomics.com/
- ↑ http://realtimegenomics.com/products/rtg-core/
- ↑ http://www.realtimegenomics.com/news/pre-formatted-reference-datasets/
- ↑
John G Cleary, Ross Braithwaite, Kurt Gaastra, Brian S Hilbush, Stuart Inglis, Sean A Irvine, Alan Jackson, Richard Littin, Sahar Nohzadeh-Malakshah, Mehul Rathod, David Ware, Len Trigg, Francisco M De La Vega
Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data.
J Comput Biol: 2014, 21(6);405-19
[PubMed:24874280] ##WORLDCAT## [DOI] (I p) - ↑
Martin Laurence, Christos Hatzis, Douglas E Brash
Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes.
PLoS One: 2014, 9(5);e97876
[PubMed:24837716] ##WORLDCAT## [DOI] (I e)
Download data files
Download data files here
[ Main_Page ]