Quality control of NGS data

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS data analysis | Downloading NGS data from NCBI | Improving the quality of NGS data ]

Before you analyze the data, it is crucial to check the quality of the data.
We will use the standard tool for checking the quality of data generated on the Illumina platform: FASTQC [1]

Correct interpretation of the report that FASTQC generates is very important.
If the quality of your data is good, you can proceed with the analysis.

!! If the quality of your data is very bad, don't immediately throw the data in the recycle bin but contact an expert and ask for his/her opinion. !!

FASTQC opens FASTQ files to check the quality of the data.

FASTQC has been installed on the BITS laptops. Go to the desktop and double click the fastqc icon to start the program.

Handicon.png Known sources of error in Illumina sequencing are commented on a QCFAIL site developed by the FastQC team [2]

Exercise 1: Quality control of the data of the introduction training

For this exercise we will use the the fastq file that we downloaded from NCBI. You can find it on the BITS laptops in the /Documents/NGSdata folder as SRR074262.fastq. For people who are not using a BITS laptop, I brought all files on a USB stick.

Step 1: Opening NGS data in FASTQC

FastQC is relatively self explanatory. Help can be found in the manual.

During loading of the file, the software keeps you informed about the progress that is being made.
Once the file is opened, the software automatically analyses the data:


FASTQC consists of multiple modules each checking a specific aspect of the quality of the data. In the left menu you can select the module you wish to view. By default it shows the results of the Basic statistics module where you can see the number and the length of the reads...

The names of the modules are preceded by an icon that reflects the quality of the data. The icon indicates whether the results of the module seem entirely normal (green tick), slightly abnormal (orange triangle) or very unusual (red cross).

These evaluations must be taken in the context of what you expect from your library. A 'normal' sample as far as FastQC is concerned is random and diverse. Some experiments are expected to produce libraries which are biased in particular ways. You should treat the icons as pointers to where you should concentrate your attention on and understand why your library may not look normal.

The basic statistics module shows that all sequence reads are 36 bases long.

Step 2: Check sequence quality per position

Phred scores represent base call quality. The higher the score the more reliable the base call. Often the quality of reads degrades over the length of the read. Therefore, it is common practice to determine the average quality of the first, second, third,...nth base of the reads by plotting the distribution of the Phred scores on each position of the reads using box plots.

Remark: In new Illumina kits the sequence quality goes up a bit first before it steadily declines.

Step 3: Check overall sequence quality

Instead of showing the quality of each position separately, you can calculate the average Phred score of each read and show a cumulative plot of the average qualities of all the reads.

Step 4: Check quality per tile

Illumina flowcells are divided into tiles. To see if there is a loss in quality associated with specific parts of the flowcell, FASTQC calculates average quality scores for each tile across all positions in the reads.

Reasons for seeing warnings or failures on this plot could be transient problems such as bubbles going through the flowcell, or they could be more permanent problems such as smudges or debris on/in the flowcell or a very high density of clusters in a tile. The most common cause of warnings in this module is the flowcell being overloaded.

It is recommended to ignore warnings/failures which mildly affected a small number of tiles for only 1 or 2 cycles, and to only pursue larger effects which showed high deviation in scores, or which persisted for several cycles.

Step 5: Check duplication levels

In a diverse library generated by shearing, most fragments will occur only once. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication indicates enrichment bias (eg PCR overamplification, contamination of the library with adapter dimers...).

The Sequence duplication levels module counts the degree of duplication for every read and creates a plot showing the relative number of reads with different degrees of duplication.

Step 6: Check per base sequence content

Since the reads are random fragments from the genome sequence, the contribution of A, C, G and T should be identical on each position.

Where do the overrepresented sequences for which no hit is found come from ?
In most cases they are adapter sequences that contain sequencing errors.

You can remove the adapter contamination by trimming adapters. There's is a lot of debate on whether it is required to do this. Some people do this rigorously, while others skip this step. If you do not remove them, reads that are contaminated with adapter sequences will simply not be mapped but their presence will slow down the mapping. So when FASTQC shows a substantial percentage of adapters (as is the case here: almost 20%) it is a good idea to remove them.

Mortasecca.png Warning: Not removing adapter contamination will affect the percentage of mapped reads during the mapping

The Overrepresented sequences module shows contamination with adapter dimers(reads that completely correspond to adapter sequences). Often you also have remnants of adapter sequences at the 3'ends of reads that come from fragments that are smaller than the read length. This form of adapter contamination is not detected by the Overrepresented sequences module.

For the analysis we'll work in GenePattern. It provides easy access to tools for different kinds of analyses (e.g. RNASeq, variant and ChIPSeq analysis) via a web browser.

Exercise 2: Quality control of the data of the ChIP-Seq training

Exercise created by Morgane Thomas-Chollier

We will use FASTQC inside GenePattern to get some basic information on the data (read length, number of reads, global quality of datasets).

Read the GenePattern tutorial for more details on how to use GenePattern.
The data is already present on the GenePattern server. When you open a tool in GenePattern, you will find the Add Paths or URLs button in the input files section:


Click the button and expand BITS trainingdata Chipseq:


The fastq file of the control data set is also available in the shared data folder (SRR576938.2.fastq)

Again, you see that the data set consists of very short reads although this data set is very recent. This is because it has been shown that elongating the reads does not improve your results in ChIP-Seq analysis. It will just cost you more money.

Again you see that adapter contamination is a frequently occurring problem of Illumina NGS data.

Now do the same for the control data set: SRR576938.2.fastq.

In theory one expects that regions with high read count in the ChIP sample represent the regions that were enriched by the immunoprecipitation, i.e. the regions that were bound to the protein. However many studies have shown that the read count is affected by many factors, including GC content, mappability, chromatin structure, copy number variations... To account for these biases, a control sample is used consisting of fragmented genomic DNA that was not subjected to immunoprecipitation or that was precipitated using a non-specific antibody.

The ChIP and control samples are usually sequenced at different depths, generating files with different total number of reads. This means that these two samples have to be made comparable later on in the analysis by normalization (see ChIP-Seq training).

Estimation of coverage

Knowing your organism size is important to evaluate if your data set has sufficient coverage to continue your analyses, e.g. for the human genome (3 Gb), 10 million reads are considered sufficient.

The FASTQC report has shown that the fastq files of the ChIP and control sample contain 3.6 and 6.7 million reads respectively. As you aim for 10 million reads for 3 Gb in human, we can assume that these data sets contain enough reads for proper analysis.

  1. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  2. https://sequencing.qcfail.com/