Analyse and quantify Metagenomic sequencing data

From BITS wiki
Jump to: navigation, search

Identify and quantify 'non-human / metagenomics' reads in NGS data

[ Main_Page ]

Introduction

Non Reference reads present in NGS data may come from sample or library prep contamination but may also be biologically relevant in case of sample mixture resulting from a metagenomic study, symbiotic relationship between a host and its guests, or samples from infected species. A popular tool to identify such non-Reference reads is Kraken [1] which makes use of a sequence database built from subsets of the NCBI data. RealTime Genomics proposes a similar pipeline presented in this white paper and that can be run without costs for Academic use.

The aim of this white paper is to introduce both methods and present the results of analysing a defined test datasets (from Kraken paper).

Repeating this work will require downloading and installing the Kraken and RTG software as well as their dependencies and reference databases.

Kraken-provided merged and synthetic datasets

The Kraken paper[1] used public data from MiSeq and HiSeq as well as synthetic data to evaluate the performance of the software. We borrowed these datasets, and the quoted sentences below, to evaluate RTG on the same data. Please refer to the Material and Methods section of the Kraken paper for more details.

  • MiSeq and HiSeq data: <<The HiSeq and MiSeq metagenomes were built using 20 sets of bacterial whole-genome shotgun reads. These reads were found either as part of the GAGE-B project or in the NCBI Sequence Read Archive. Each metagenome contains sequences from ten genomes. For both the 10,000 and 10 million read samples of each of these metagenomes, 10% of their sequences were selected from each of the ten component genome data sets (i.e., each genome had equal sequence abundance). All sequences were trimmed to remove low quality bases and adapter sequences. >>[1]

The names associated with the strains of the 20 genomes are:

MiSeq:
  - Bacillus cereus VD118
  - Citrobacter freundii 47N
  - Enterobacter cloacae
  - Klebsiella pneumoniae NES14
  - Mycobacterium abscessus 6G-0125-R
  - Proteus vulgaris 66N
  - Rhodobacter sphaeroides 2.4.1
  - Staphylococcus aureus ST22
  - Salmonella enterica Montevideo str. N19965
  - Vibrio cholerae CP1032(5)

HiSeq:
  - Aeromonas hydrophila SSU
  - Bacillus cereus VD118
  - Bacteroides fragilis HMW615
  - Mycobacterium abscessus 6G-0125-R
  - Pelosinus fermentans A11
  - Rhodobacter sphaeroides 2.4.1
  - Staphylococcus aureus M0927
  - Streptococcus pneumoniae TIGR4
  - Vibrio cholerae CP1032(5)
  - Xanthomonas axonopodis pv. Manihotis UA323
  • simBA5 data: << This sample, featuring simulated bacterial and archaeal reads (called simBA-5), was created with an error rate five times higher than would be expected, to evaluate Kraken’s performance on data that contain many errors or have strong differences from Kraken’s genomic library >>[1]

Run Kraken with the provided 'accuracy' datasets

Requirements

  • Kraken was installed[2] together with the minimal database minikraken[3], the standard database was also built as described in the kraken Install document.
  • Jellyfish version 1[4] is required by Kraken and was also installed.
  • accuracy example datasets[5] used in the Kraken paper were copied to the local disk.
  • Krona[6]was installed from the separate developer site[7] to print the kraken results as pie charts.

Example of running Kraken on simBA5

# run against minikraken
mkdir simBA5_minikraken
 
kraken --threads 12 --preload \
   --db $KRAKENDB/minikraken simBA5_accuracy.fa \
   --fasta-input > simBA5_minikraken/simBA5_accuracy.minikraken
 
# run against the standard kraken DB
mkdir simBA5_kraken-std
 
kraken --threads 12 --preload \
   --db $KRAKENDB/standard simBA5_accuracy.fa \
   --fasta-input > simBA5_kraken-std/simBA5_accuracy.kraken

Converting the Kraken results to a live-interactive dendrogram with Krona

# run for minikraken results
# Get the name of each sequence (fasta header) and the assigned taxonomy ID using the output from step 1
cut -f2,3  simBA5_minikraken/simBA5_accuracy.minikraken \
	> simBA5_minikraken/simBA5_accuracy.minikraken.krona.in
 
# run krona on the extraction
ImportTaxonomy.pl simBA5_accuracy.minikraken_krona.in -o simBA5_accuracy.minikraken_krona.html
 
# run for kraken-standard results
# Get the name of each sequence (fasta header) and the assigned taxonomy ID using the output from step 1
cut -f2,3  simBA5_kraken-std/simBA5_accuracy.kraken \
	> simBA5_kraken-std/simBA5_accuracy.kraken_krona.in
 
# run krona on the extraction
ImportTaxonomy.pl simBA5_accuracy.kraken_krona.in -o simBA5_accuracy.kraken_krona.html

Handicon.png you can avoid the cut command by running ImportTaxonomy.pl with addition of two parameters '-q 2 -t 3' to specify the right columns for data

Handicon.png To simplify the output, only the Bacteria superkingdom is shown in the following pictures starting at level 4


minikraken_simBA5.png

Handicon.png Access here the live Krona page (also from the links at the bottom of this page)


kraken.std_simBA5.png

Handicon.png Access here the live Krona page (also from the links at the bottom of this page)

Run the RTG composition-meta-pipeline on simBA5

rtg_logo.png
[8]

Requirements

  • The RTG core code [9] and the pre-compiled RTG species database [10] were installed on our server (our institute qualifies for the free version, please check for your own access restrictions).

The meta-pipeline command

Although RTG[11] can do much more than this application, it also comes with a all-in-one meta command to perform the metagenomics quantification from NGS reads as described below.

Technical.png RTG comes with code to generate Krona html files out of the box

# run against RTG species and filter databases
 
# first convert the fasta query to a RTG SDF database
rtg format --format=fasta -o simBA5_accuracy.sdf simBA5_accuracy.fa
 
# then run the meta-pipeline
rtg composition-meta-pipeline --input=simBA5_accuracy.sdf \
	--output=simBA5_accuracy.rtg \
	--filter=$RTG_INDEXES/filter \
	--species=$RTG_INDEXES/species
 
# the pipeline consists in several steps reproduced here from the stdout:
 
#1 (contaminant removal- e.g. human reads in microbiome samples)
## rtg mapf  --output simBA5_accuracy.rtg/mapf \
##	--template /data2/biodata/rtg_indexes/filter \
##	--sam-rg @RG\tPL:ILLUMINA\tSM:sample\tID:id \
##	--input simBA5_accuracy.sdf
 
READ MAPPINGS
    0   0.0% mapped
   77   0.8% unmapped with poor hits (XC = D)
 9923  99.2% unmapped with no hits
10000 100.0% total
 
#2 (reference alignment to nucleotide sequences)
## rtg map  --output simBA5_accuracy.rtg/map \
##	--template /data2/biodata/rtg_indexes/species \
##	--max-mismatches 10% \
##	--max-top-results 100 \
##	--sam-rg @RG\tPL:ILLUMINA\tSM:sample\tID:id \
##	--input simBA5_accuracy.rtg/mapf/unmapped.sdf
 
READ MAPPINGS
 4769  47.7% mapped uniquely (NH = 1)
 2989  29.9% mapped ambiguously (NH > 1)
    0   0.0% unmapped due to read frequency (XC = B)
  232   2.3% unmapped with too many hits (XC = C)
 1511  15.1% unmapped with poor hits (XC = D)
    0   0.0% unmapped with too many good hits (XC = E)
  499   5.0% unmapped with no hits
10000 100.0% total
 
#3 (produce abundance estimates, genome coverage statistics, and diversity metrics)
## rtg species  --output simBA5_accuracy.rtg/species \
##	--genomes /data2/biodata/rtg_indexes/species \
##	simBA5_accuracy.rtg/map/alignments.bam
 
Diversity metrics
       Richness:    0.0000
        Shannon:    0.0000
         Pielou:   -0.0000
Inverse Simpson:  Infinity
 
# finally, review the results in the report folder (and more in other folders)
firefox simBA5_accuracy.rtg/report/speciesReport/index.html

Handicon.png To simplify the output, only the Bacteria superkingdom is shown in the following pictures starting at level 5 for both 'abundance' and 'fraction' results


RTG_simBA5-abund.png


RTG_simBA5-frac.png

Handicon.png Access here the live RTG-Krona page (also from the links at the bottom of this page)

RTG composition-meta-pipeline manpage

command arguments

rtg composition-meta-pipeline --help
Usage: rtg composition-meta-pipeline [OPTION]... --output DIR --input SDF|FILE
                                     [OPTION]... --output DIR --input-left FILE --input-right FILE
 
Runs a metagenomic composition pipeline. 
The pipeline consists of read filtering, read alignment, then species composition.
 
File Input/Output
      --filter=SDF       filter reference sequence 
         (Default is /opt/biotools/rtg-core-non-commercial-3.7.1/references/filter)
      --input=SDF|FILE   input SDF or single end fastq input
      --input-left=FILE  left arm of paired end FASTQ input
      --input-right=FILE right arm of paired end FASTQ input
      --output=DIR       output directory
      --platform=STRING  platform of the input data. 
         Allowed values are [illumina, iontorrent] (Default is illumina)
      --species=SDF      SDF containing reference sequences 
           (Default is /opt/biotools/rtg-core-non-commercial-3.7.1/references/species)
 
Utility
  -h, --help             print help on command-line flag usage

Results for the Kraken MiSeq and HiSeq compiled datasets

Similarly, two other datasets were created in the Kraken paper, that mimic MiSeq and HiSeq data. We ran these two using kraken-standard and RTG composition-meta-pipeline.

Results for the MiSeq merged data

# run against the standard kraken DB
mkdir MiSeq_kraken-std
 
kraken --threads 12 --preload \
   --db $KRAKENDB/standard MiSeq_accuracy.fa \
   --fasta-input > MiSeq_kraken-std/MiSeq_accuracy.kraken
 
Loading database... complete.
10000 sequences (1.56 Mbp) processed in 0.364s (1650.6 Kseq/m, 257.87 Mbp/m).
  8002 sequences classified (80.02%)
  1998 sequences unclassified (19.98%)
 
cut -f2,3  MiSeq_kraken-std/MiSeq_accuracy.kraken \
	> MiSeq_kraken-std/MiSeq_accuracy.kraken.krona.in
 
# run krona on the extraction
ImportTaxonomy.pl MiSeq_kraken-std/MiSeq_accuracy.kraken.krona.in \
	-o MiSeq_kraken-std/MiSeq_accuracy.kraken.krona.html

Handicon.png To simplify the output, only the Bacteria superkingdom is shown in the following pictures starting at level 4


MiSeq_kraken-std.png

Handicon.png Access here the live Krona page (also from the links at the bottom of this page)

rtg composition-meta-pipeline --input=MiSeq_accuracy.sdf --output=MiSeq_accuracy.rtg --filter=$RTG_INDEXES/filter --species=$RTG_INDEXES/species
## rtg mapf  --output MiSeq_accuracy.rtg/mapf --template /data2/biodata/rtg_indexes/filter --sam-rg @RG\tPL:ILLUMINA\tSM:sample\tID:id --input MiSeq_accuracy.sdf
 
READ MAPPINGS
    0   0.0% mapped
  127   1.3% unmapped with poor hits (XC = D)
 9873  98.7% unmapped with no hits
10000 100.0% total
## rtg map  --output MiSeq_accuracy.rtg/map --template /data2/biodata/rtg_indexes/species --max-mismatches 10% --max-top-results 100 --sam-rg @RG\tPL:ILLUMINA\tSM:sample\tID:id --input MiSeq_accuracy.rtg/mapf/unmapped.sdf
 
READ MAPPINGS
 3949  39.5% mapped uniquely (NH = 1)
 4511  45.1% mapped ambiguously (NH > 1)
    0   0.0% unmapped due to read frequency (XC = B)
   75   0.8% unmapped with too many hits (XC = C)
  487   4.9% unmapped with poor hits (XC = D)
    0   0.0% unmapped with too many good hits (XC = E)
  978   9.8% unmapped with no hits
10000 100.0% total
## rtg species  --output MiSeq_accuracy.rtg/species --genomes /data2/biodata/rtg_indexes/species MiSeq_accuracy.rtg/map/alignments.bam
Diversity metrics
       Richness:   15.000
        Shannon:   1.4116
         Pielou:  0.52128
Inverse Simpson:   42.790

Handicon.png To simplify the output, only the Bacteria superkingdom is shown in the following pictures starting at level 5 for both 'abundance' and 'fraction' results


RTG_MiSeq-abund.png


RTG_MiSeq-frac.png

Handicon.png Access here the live RTG-Krona page (also from the links at the bottom of this page)

Results for the HiSeq merged data

# run against the standard kraken DB
mkdir HiSeq_kraken-std
 
kraken --threads 12 --preload \
   --db $KRAKENDB/standard HiSeq_accuracy.fa \
   --fasta-input > HiSeq_kraken-std/HiSeq_accuracy.kraken
 
Loading database... complete.
10000 sequences (0.92 Mbp) processed in 0.239s (2513.2 Kseq/m, 231.83 Mbp/m).
  7891 sequences classified (78.91%)
  2109 sequences unclassified (21.09%)
 
cut -f2,3  HiSeq_kraken-std/HiSeq_accuracy.kraken \
	> HiSeq_kraken-std/HiSeq_accuracy.kraken.krona.in
 
# run krona on the extraction
ImportTaxonomy.pl HiSeq_kraken-std/HiSeq_accuracy.kraken.krona.in \
	-o HiSeq_kraken-std/HiSeq_accuracy.kraken.krona.html

Handicon.png To simplify the output, only the Bacteria superkingdom is shown in the following pictures starting at level 4


HiSeq_kraken-std.png

Handicon.png Access here the live Krona page (also from the links at the bottom of this page)

rtg composition-meta-pipeline --input=HiSeq_accuracy.sdf --output=HiSeq_accuracy.rtg \
> --filter=$RTG_INDEXES/filter \
> --species=$RTG_INDEXES/species
## rtg mapf  --output HiSeq_accuracy.rtg/mapf --template /data2/biodata/rtg_indexes/filter --sam-rg @RG\tPL:ILLUMINA\tSM:sample\tID:id --input HiSeq_accuracy.sdf
 
READ MAPPINGS
    1   0.0% mapped
   86   0.9% unmapped with poor hits (XC = D)
 9913  99.1% unmapped with no hits
10000 100.0% total
## rtg map  --output HiSeq_accuracy.rtg/map --template /data2/biodata/rtg_indexes/species --max-mismatches 10% --max-top-results 100 --sam-rg @RG\tPL:ILLUMINA\tSM:sample\tID:id --input HiSeq_accuracy.rtg/mapf/unmapped.sdf
 
READ MAPPINGS
2315  23.2% mapped uniquely (NH = 1)
6732  67.3% mapped ambiguously (NH > 1)
   0   0.0% unmapped due to read frequency (XC = B)
  54   0.5% unmapped with too many hits (XC = C)
 237   2.4% unmapped with poor hits (XC = D)
   0   0.0% unmapped with too many good hits (XC = E)
 661   6.6% unmapped with no hits
9999 100.0% total
 
## rtg species  --output HiSeq_accuracy.rtg/species --genomes /data2/biodata/rtg_indexes/species HiSeq_accuracy.rtg/map/alignments.bam
Diversity metrics
       Richness:   11.000
        Shannon:   1.1145
         Pielou:  0.46479
Inverse Simpson:   48.422

Handicon.png To simplify the output, only the Bacteria superkingdom is shown in the following pictures starting at level 5 for both 'abundance' and 'fraction' results


RTG_HiSeq-abund.png


RTG_HiSeq-frac.png

Handicon.png Access here the live RTG-Krona page (also from the links at the bottom of this page)


Kranen performance with real data

A real NextSeq dataset was chosen to measure runtime for the Kraken commands. We used two read types pre-processed by the Illumina barcode pipeline. For both datasets, the Kraken tool was run without preloading the reference as this was already present in RAM from previous searches. Within each dataset, we used either read#1 or read#1+read#2 (paired) in order to evaluate the added value of using paired read data.

Technical.png In case of paired input, kraken merges the two read sequences with addition of one 'n'-base in order to mimic a long-reads

Undetermined reads

We first look at 'Undetermined' reads that did not match any barcode (these include the Phi-X spike-in reads). The first Lane was chosen arbitrarily.

Single read file but not preloading (done before and still in RAM)

time kraken --threads 12 \
	--db $KRAKENDB/standard \
	--fastq-input \
	--gzip-compressed \
	Undetermined_S0_L001_R1_001.fastq.gz \
	> Undetermined_S0_L001_R1_001_noload.kraken
 
1098052 sequences (164.57 Mbp) processed in 8.122s (8112.1 Kseq/m, 1215.77 Mbp/m).
  792816 sequences classified (72.20%)
  305236 sequences unclassified (27.80%)
 
real	0m10.348s
user	1m21.067s
sys	0m7.383s

Followed by Krona formatting

time ImportTaxonomy.pl \
	-q 2 -t 3 \
	 Undetermined_S0_L001_R1_001_noload.kraken \
	-o Undetermined_S0_L001_R1_001_noload.kraken.html
 
   [ WARNING ]  Score column already in use; not reading scores.
Loading taxonomy...
Importing Undetermined_S0_L001_R1_001_noload.kraken...
Writing Undetermined_S0_L001_R1_001_noload.kraken.html...
   [ WARNING ]  Too many query IDs to store in chart; storing supplemental files in 'Undetermined_S0_L001_R1_001_noload.kraken.html.files'.
 
real	0m37.958s
user	0m37.522s
sys	0m0.434s

krona results

NC-Illumina-all_single-noload.png


NC-Illumina-bacteria_single-noload.png

Handicon.png Access here the live Krona page (also from the links at the bottom of this page)

Paired read files and no preloading

time kraken --threads 12 \
	--db $KRAKENDB/standard \
	--fastq-input \
	--paired \
	--gzip-compressed \
	Undetermined_S0_L001_R1_001.fastq.gz \
	Undetermined_S0_L001_R2_001.fastq.gz \
	> Undetermined_S0_L001_paired_noload.kraken
 
1098052 sequences (330.09 Mbp) processed in 32.683s (2015.8 Kseq/m, 606.00 Mbp/m).
  812547 sequences classified (74.00%)
  285505 sequences unclassified (26.00%)
 
real	0m34.948s
user	2m19.602s
sys	0m10.071s

Followed by Krona formatting

time ImportTaxonomy.pl \
	-q 2 -t 3 \
	Undetermined_S0_L001_paired_noload.kraken \
	-o Undetermined_S0_L001_paired_noload.kraken.html
 
   [ WARNING ]  Score column already in use; not reading scores.
Loading taxonomy...
Importing Undetermined_S0_L001_paired_noload.kraken...
Writing Undetermined_S0_L001_paired_noload.kraken.html...
   [ WARNING ]  Too many query IDs to store in chart; storing supplemental files in 'Undetermined_S0_L001_paired_noload.kraken.html.files'.
 
real	0m38.824s
user	0m38.211s
sys	0m0.611s

krona results

NC-Illumina-all_paired-noload.png


NC-Illumina-bacteria_paired-noload.png

Handicon.png Access here the live Krona page (also from the links at the bottom of this page)

barcode sorted reads

The first Lane (labelled 9-11) was chosen arbitrarily. Many more reads are present as compared to the initial 'Undetermined' dataset. The results obtained here can be compared to published usual suspect[12] like for instance Bradyrhizobium found below to 0.8-0.9% of the Bacterial species (0.01-0.02% of total reads).

Single read file and not preloading the database

time kraken \
	--threads 12 \
	--db $KRAKENDB/standard \
	--fastq-input \
	--gzip-compressed \
	 911_S4_L001_R1_001.fastq.gz \
	 > 911_S4_L001_R1_001_noload.kraken
 
7849590 sequences (1170.85 Mbp) processed in 58.334s (8073.8 Kseq/m, 1204.30 Mbp/m).
  138975 sequences classified (1.77%)
  7710615 sequences unclassified (98.23%)
 
real	1m2.461s
user	10m1.443s
sys	0m51.017s

Followed by Krona formating

time ImportTaxonomy.pl \
	-q 2 -t 3 \
	 911_S4_L001_R1_001_noload.kraken \
	-o 911_S4_L001_R1_001_noload.kraken.html
 
   [ WARNING ]  Score column already in use; not reading scores.
Loading taxonomy...
Importing 911_S4_L001_R1_001_noload.kraken...
Writing 911_S4_L001_R1_001_noload.kraken.html...
   [ WARNING ]  Too many query IDs to store in chart; storing supplemental files in '911_S4_L001_R1_001_noload.kraken.html.files'.
 
real	1m29.650s
user	1m28.036s
sys	0m1.608s

krona results

Focussing on Bacteria allows digging into the other species as shown in the second picture.


NC-Illumina-911-all_single-noload.png


NC-Illumina-911-bacteria_single-noload.png

Handicon.png Access here the live Krona page (also from the links at the bottom of this page)

paired read file and not preloading the database

time kraken --threads 12 \
	--db $KRAKENDB/standard \
	--fastq-input \
	--gzip-compressed \
	--paired \
	911_S4_L001_R1_001.fastq.gz \
	911_S4_L001_R2_001.fastq.gz \
	> 911_S4_L001_paired_noload.fastq.kraken
 
7849590 sequences (2349.27 Mbp) processed in 231.315s (2036.1 Kseq/m, 609.37 Mbp/m).
  253067 sequences classified (3.22%)
  7596523 sequences unclassified (96.78%)
 
real	3m55.151s
user	21m3.505s
sys	1m41.701s

Followed by Krona formating

time ImportTaxonomy.pl \
	-q 2 -t 3 \
	911_S4_L001_paired_noload.fastq.kraken \
	-o 911_S4_L001_paired_noload.fastq.kraken.html
 
   [ WARNING ]  Score column already in use; not reading scores.
Loading taxonomy...
Importing 911_S4_L001_paired_noload.fastq.kraken...
Writing 911_S4_L001_paired_noload.fastq.kraken.html...
   [ WARNING ]  Too many query IDs to store in chart; storing supplemental files in '911_S4_L001_paired_noload.fastq.kraken.html.files'.
 
real	1m34.377s
user	1m32.630s
sys	0m1.739s

krona results

Focussing on Bacteria allows digging into the other species as shown in the second picture.


NC-Illumina-911-all_paired-noload.png


NC-Illumina-911-bacteria_paired-noload.png

Handicon.png Access here the live Krona page (also from the links at the bottom of this page)



References:
  1. 1.0 1.1 1.2 1.3

    Derrick E Wood, Steven L Salzberg
    Kraken: ultrafast metagenomic sequence classification using exact alignments.
    Genome Biol: 2014, 15(3);R46
    [PubMed:24580807] ##WORLDCAT## [DOI] (I e)

  2. http://ccb.jhu.edu/software/kraken/
  3. http://ccb.jhu.edu/software/kraken/dl/minikraken.tgz
  4. http://www.cbcb.umd.edu/software/jellyfish/
  5. http://ccb.jhu.edu/software/kraken/dl/accuracy.tgz
  6. Brian D Ondov, Nicholas H Bergman, Adam M Phillippy
    Interactive metagenomic visualization in a Web browser.
    BMC Bioinformatics: 2011, 12;385
    [PubMed:21961884] ##WORLDCAT## [DOI] (I e)

  7. https://github.com/marbl/Krona/wiki
  8. http://realtimegenomics.com/
  9. http://realtimegenomics.com/products/rtg-core/
  10. http://www.realtimegenomics.com/news/pre-formatted-reference-datasets/
  11. John G Cleary, Ross Braithwaite, Kurt Gaastra, Brian S Hilbush, Stuart Inglis, Sean A Irvine, Alan Jackson, Richard Littin, Sahar Nohzadeh-Malakshah, Mehul Rathod, David Ware, Len Trigg, Francisco M De La Vega
    Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data.
    J Comput Biol: 2014, 21(6);405-19
    [PubMed:24874280] ##WORLDCAT## [DOI] (I p)

  12. Martin Laurence, Christos Hatzis, Douglas E Brash
    Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes.
    PLoS One: 2014, 9(5);e97876
    [PubMed:24837716] ##WORLDCAT## [DOI] (I e)

    
    

Download data files

Download data files here

Use the right application to open the files present in Metagenomics data files


[ Main_Page ]