Q&A added during the intro to NGS data analysis

From BITS wiki
Jump to: navigation, search

Please find here questions raised during the session with some element of answer found with Dr G

[ Main_Page | NGS_data_analysis ]


what are duplicates and what to do about them (an introduction!)

This is a central and still controversial question in NGS analysis. The answer depends on the aim of the experiment, the type of NGS applied, and the complexity of the data.

In short you can get optical duplicates and library duplicates in your NGS data.

  • optical duplicates (Illumina) are obtained when a single cluster of reads is part of two adjacent tiles' on the same slide and used to compute two read calls separately (see a nice slide in ’this presentation pdf). Optical duplicates are very similar in sequence (except sequencing errors). They can be identified where the first N bases are identical between two reads (N=50 if often used rather than N=read length since sequencing quality is known to decrease with position index in the read). They can be identified without alignment. You just check sequence and the coordinates on the image. Picard’s “MarkDuplicates.jar” can remove optical dups
  • library duplicates (aka PCR duplicates) are obtained when the original sample is pre-amplified to such extend that initial unique targets are PCR replicated prior to library preparation and will lead to several independent spots on the Illumina slide. PCR duplicates need therefore not be adjacent on the slide but may share a very high level of sequence identity. The level of library duplicates found after mapping all your reads will tell you how much your initial sample was saturating as compared to the genome complexity and width. PCR duplicates are usually identified after alignment.
    REM: as Heng Li states in this SeqAnswers post: "Dedup for single-end reads has high false positive rate given deep data. This is fine for SNP calling because the 40X coverage is enough for SNP calling. However, it is dangerous to dedup SE reads for RNA-seq or ChIP-seq where read count matters. As mrawlins has suggested, it would be better to account for duplicates in your read counting model rather than run a dedup program.". When working with paired-end reads, filtering reads that have identical mate mappings removes PCR dups.


what is the base system used in my FastQ data (could be public data)

There is no standalone tool outside of Galaxy to identify the ASSCI base used by Illumina in a fast file. If you do not work in Galaxy, you can manually determine the base (33 or 64) by scanning a number of reads and recording the lowest and highest quality score associated with base calls. The result compared to the Wikipedia FastQ table (found here and reproduced below) will provide the answer and allow setting the right Q-value (e.g. when using FastQC).



Nanopore sequencing

How much sample do I need for Nanopore ?

This is still hypothetical as the device was freshly added to the VIB arsenal and was not yet tested but the sample quality is expected very low as each single molecule is analyzed individually in a pore (please contact the VIB Nucleomics core facility[2] for more information).

Can I analyze variants with Nanopore ?

Long read technologies are not yet suited for variant analysis, only for assembly, Iso-Seq (sequencing full transcripts to capture transcription start, polyadenylation, and splice sites), copy number variations and structural variations.

Which software do I use to analyze Nanopore sequences ?

Best is to look at the Nanopore website for an up-to-date list of programs. The Nanopore tools are only available for registered people (customers). If you are a customer, you will get access to the Nanopore forum and pages where questions are nicely answered and progams links are available.

See this website for an overview of tools for analyzing nanopore data.

Other tools do work with Nanopore data: the BWA-mem mapper and the Canu assembler for instance work perfectly on Nanopore data when you adjust the settings (look at the man page).

How does Illumina sequencing works on the chip ?

a YouTube video showing how Illumina sequencing works: http://www.youtube.com/watch?v=Zqr8_KiuzHU and a PDF showing it in a more clear manner (scan through slides to #17) http://www.corning.com/uploadedFiles/Lifesciences/PDFs/Axygen_PDFs/NGS%20with%20Illumina%20GAII(3).pdf (please watch videos behind other links included in the provided slides!)


How can I install FASTQ groomer as standalone (without installing GALAXY on my computer) ?

Installing galaxy on a laptop IS possible and will not use any resource when not running (you can to start galaxy with a simple command under terminal before navigating to your browser and stop galaxy when you are done working with it). The installation process is very well documented on the Galaxy pages but of course you will need to upload your data to your local galaxy before using it.

If you still prefer to use only groomer as a standalone command line tool, this seems possible with some tweaking, please read this page for advice http://gmod.827538.n3.nabble.com/question-about-fastq-groomer-td1171208.html


How much coverage do I need to ensure good quality results ?

This is an open discussion and we cannot provide the answer but only report trends and common limits. The final limit is what you can pay for and is a compromise between the ideal situation and the affordable. Please discuss this issue before ordering with your provider. Remember that the accessible coverage depends on the amount of sample that you use for the making of the library and that over-amplifying a small sample will not make it better but only worse..

You should always be explicit about which type of coverage you are talking about.

  • coverage width is the expression of how much of the target is represented by reads (how many of the bases are sequenced in at least one read (a minimum limit of 5 would be better and is used by some programs). The full human genome is 3Gb wide while the full human exome is only 1% of this (~30Mb-50Mb including exon borders).
  • coverage depth is the expression of the number of reads that pile-up in average across the covered sequences

What you order to your provider are millions of reads, what you get after mapping is a coverage depth. Below are (non exhaustive) numbers of reads and coverage values taken from internet pages.

'When used with a 'x' symbol, coverage means here depth of coverage as compared to the full reference Lets compute how much is 1x for the human genome the human genome si about 3Gb wide! (all bases from all assembled chromosomes) we sequence with Illumina paired reads of 100bps we will ideally need 3.10^9 / 100 = 30M juxtaposed reads to cover the full genome 1x with a depth of 1 read

 => 1x = 30M 100bps reads for the human genome

evaluation of coverage requirements for different NGS techniques

  • For DNA-Seq population genetics applications, a full genome average coverage of 4x is considered sufficient if you have multiple genomes in your group.
  • For variant analysis a coverage of 20x and higher will be required to read accurately any given position of the genome and be able to find variants (it is often required to have at least 2 to 5 evidence reads containing a variation to accept it).
  • for DNA.exome-Seq, a coverage of 20x for each replicate is considered good. Exome-seq is just DNA-seq but on a reduced reference genome.
  • for Cancer-Seq, the issue is being able to detect a clone in the tumour mass represented in as low as x% of the whole tumor biopsy (x~5-20%) and that carries specific variants (eg: metastatic drivers!). To reach that goal, sequencing coverage should be increased by a factor 5-to-20 time as compared to a normal diploid genome-Seq (and the genome of a normal tissue of the same cancer patient will also be sequenced for comparison). A coverage depth target of 80x is not exceptional in cancer seq experiments.
  • For human or mouse RNA-Seq, a coverage of 4X or more on exonic regions is enough to compute expression levels (similar to microarray oligo counting). The exome is about 1% of the full genome.
    • more coverage depth will be required if you also want to call variations in exons (see to genome DNA-seq). It is often reported that 50M (million) reads allow a correct RNA-Seq experiment but >100M reads is better for low expression transcripts and difficult to map exons.
    • for RNA-Seq, it is better to pay for 3+ sequencing experiments (replicates) with 40M reads each that for one single experiment with 120M reads because the first will allow computing confidence (p-values) and the second will report only one count and will not allow identifying outliers. Beware, that spitting your 120M reads in three batches of 40M is NOT THE SAME as performing three distinct experiments (search for biological vs technical replicate). The ability to call differential expression for low-expression genes will obviously be limited by the read depth coverage in these transcript region(s).

additional Information from Wout (VIB Nucleomics core[2])

  • Human quantitative mRNAseq: absolute minimum is 6M 50bp reads per sample, but we aim at 15M reads
  • Human quantitative total RNAseq: we aim at 30M reads per sample
  • Human mRNAseq to discover alternative splicing: 20-30M reads (but we haven't done that yet)
  • Human exome sequencing to discover SNPs: 40-50M reads (but we haven't done that yet)
  • Human mRNAseq to discover novel transcripts: 50-60M reads (but we haven't done that yet)
  • Human Chipseq: with 5-10M 50bp reads per sample you may find already most of the peaks, but one should aim at 30-50M reads per sample according to Active Motif. It depends also on the antibody that is used: for some antibodies up to 95% of the reads map outside the peaks. Active Motifs' best antibody (H3K4me3) delivers about 60% of the reads within peaks.
  • Human amplicon sequencing: coverage depends on the minimal % of variant that has to be detected

(thanks Wout for sharing this info)

How much coverage do I need? - recent publications

  • Please read the nice review in Nat. Rev. Genet. link [3]
  • Also look at this paper in Bioinformatics link [4]


How to interpret FASTQC reports

FastQC is a vital diagnostic step that allows identifying large deviations from the normal sequencing situation. A good QC reports is a good thing and you can decide to proceed with analysis. On the other hand, many errors reported in the FASTQC will allow spotting artefacts but are not necessarily a sign of bad data. Always perform QC before spending time on data you should not trust and discuss your findings with experts (the provider or other scientists with experience).

Common sources of errors in NGS data can be found on the QCFAIL [5] web resource.


Tools for analyzing time series RNA-Seq data ?

Assuming it's dependent data (the same individuals are used for all time points):

  • When you have more than 8 time points: impulseDE2 package
  • When you have less than 8 time points: EdgeR or DESeq2: The manual of edgeR suggests to ignore the fact that the measurements are repeated. In DESeq2 you can use a paired design.
  • Perform a trend analysis, adding time as numerical covariate in the model.
  • Perform cluster analysis to look for genes with similar expression patterns.
Read this comparison of traditional tools (DESeq2 and EdgeR) and time-series specific tools.


How to map miRNA-Seq reads ?

miRNA families can be very redundant (e.g. let-7) and the distinct miRNAs are difficult to identify and to validate. So for these it's better to quantify the whole family instead of the individual miRNAs.


Nucleomics Core maps to the genome first using bowtie2 looking for the best alignment:
e.g. bowtie2 -q -p 20 --seed 1 -L 15 —local
Bowtie2 should work as well as bowtie1, since bowtie2 should not return a gapped alignment if a contiguous one exists. However, some people prefer bowtie1 since you can guarantee this by setting the "--best --strata" parameter, which bowtie2 does not do.

The ENCODE project uses STAR to map "small RNA" (~<200b) reads to the genome with the following parameters:

  • --outFilterMismatchNoverLmax 0.05 meaning number of mismatches <= 5% of mapped length, i.e. 0 mismatches for 16-19bp, 1 mismatch for 20-39bp
  • --outFilterMatchNmin 16 meaning reads must match >=16bp to the genome
  • --outFilterScoreMinOverLread 0
  • --outFilterMatchNminOverLread 0
  • --alignIntronMax 1 meaning splicing switched off
  • --clip3pAdapterSeq TGGAATTCTC meaning clip 3'adapter (you can also do this with cutadapt or Trimmomatic before feeding the reads to STAR)
  • --clip3pAdapterMMp 0.1 meaning the proportion of mismatches in the matched adapter length.
  • --alignEndsType EndToEnd meaning it will force end-to-end alignments avoiding soft clipping at the 5'end (see the discussion about "Soft clipping"). In old versions of STAR this parameter is not available but you can use this awk script to filter out all alignments that are trimmed by more than 1 base from the 5':
    awk '{S=0; split($6,C,/[0-9]*/); n=split($6,L,/[NMSID]/); if (and($2,0x10)>0 && C[n]=="S") {S=L[n-1]} else if (and($2,0x10)==0 && C[2]=="S") {S=L[1]}; if (S<=1) print }' Aligned.out.sam > Aligned.filtered.sam
  • --outFilterMultimapNmax 10 the default setting for this parameter is 10 meaning that reads that map to more than 10 locations are discarded. You can increase this number.

There are better aligners for miRNA data, which also include analysis of the alignment context to give a more accurate picture of expression: see MirDeep2 and ShortStack.


Nucleomics Core gets >90% mapped reads when mapping to the full genome while they only have 50-65% mapped reads on the mirBase database. They performed internal comparisons of different small RNA library prep kits and found that you can still detect a lot of non miRNA species (non coding, coding and ribosomal RNAs). So it's better to map to the genome. However, you will get many multimappers (the miRNA will align to all its targets): more than 40-50% multimappers is not unusual.

How to count miRNA-Seq reads ?

By default, HTSeq-count will ignore/discard multimappers. However, you can force HTSeq-count to quantitate multimappers. It depends on two things:

  • Mapping quality of multimappers (which is usually set by the mapper to a very low value or zero)
  • Any SAM tags indicating a multimapper (the NH tag, seethe documentation). Only TopHat uses this tag, bowtie1 and bowtie2 do not. Even if the tag appears in your bam/sam-file you could strip it:
    e.g. samtools view in.bam | perl -pe 's/\s+NH:i:\d+//' | htseq-count -s no -a 0 -m intersection-nonempty - genes.gtf > counts.txt

  1. http://seqanswers.com/forums/showpost.php?p=25283
  2. 2.0 2.1 http://www.nucleomics.be
  3. David Sims, Ian Sudbery, Nicholas E Ilott, Andreas Heger, Chris P Ponting
    Sequencing depth and coverage: key considerations in genomic analyses.
    Nat. Rev. Genet.: 2014, 15(2);121-32
    [PubMed:24434847] ##WORLDCAT## [DOI] (I p)

  4. Yuwen Liu, Jie Zhou, Kevin P White
    RNA-seq differential expression studies: more sequence or more replication?
    Bioinformatics: 2014, 30(3);301-4
    [PubMed:24319002] ##WORLDCAT## [DOI] (I p)

  5. https://sequencing.qcfail.com/

[ Main_Page | NGS_data_analysis ]