Quality control of NGS data

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS data analysis | Downloading NGS data from NCBI | Improving the quality of NGS data ]

Before you analyze the data, it is crucial to check the quality of the data.
We will use the standard tool for checking the quality of data generated on the Illumina platform: FASTQC [1]

Correct interpretation of the report that FASTQC generates is very important.
If the quality of your data is good, you can proceed with the analysis.

!! If the quality of your data is very bad, don't immediately throw the data in the recycle bin but contact an expert and ask for his/her opinion. !!

FASTQC opens fastq files to check the quality of the data. A FASTQC report is generated for each fastq file separately.

Install FASTQC. Go to the FASTQC folder and double click the run_fastqc.bat file to start the program on Windows.

Handicon.png Known sources of error in Illumina sequencing are commented on a QCFAIL site developed by the FastQC team [2]

Exercise 1: Quality control of the data of the introduction training

For this exercise we will use a fastq file that can be downloaded from NCBI or from the BITS website.

Step 1: Opening NGS data in FASTQC

FastQC is relatively self explanatory. Help can be found in the manual.

During loading of the file, the software keeps you informed about the progress that is being made.
Once the file is opened, the software automatically analyses the data:


FASTQC2b.png

FASTQC consists of multiple modules each checking a specific aspect of the quality of the data. In the left menu you can select the module you wish to view. By default it shows the results of the Basic statistics module where you can see the number and the length of the reads...

The names of the modules are preceded by an icon that reflects the quality of the data. The icon indicates whether the results of the module seem entirely normal (green tick), slightly abnormal (orange triangle) or very unusual (red cross).

These evaluations must be taken in the context of what you expect from your library. A 'normal' sample as far as FastQC is concerned is random and diverse. Some experiments are expected to produce libraries which are biased in particular ways. You should treat the icons as pointers to where you should concentrate your attention on and understand why your library may not look normal.

The basic statistics module shows that all sequence reads are 36 bases long.


Step 2: Check sequence quality per position

Phred scores represent base call quality. The higher the score the more reliable the base call. Often the quality of reads degrades over the length of the read. Therefore, it is common practice to determine the average quality of the first, second, third,...nth base of the reads by plotting the distribution of the Phred scores on each position of the reads using box plots.

Remark: In new Illumina kits the sequence quality goes up a bit first before it steadily declines.

Step 3: Check overall sequence quality

Instead of showing the quality of each position separately, you can calculate the average Phred score of each read and show a cumulative plot of the average qualities of all the reads.


Step 4: Check quality per tile

Illumina flowcells are divided into tiles. To see if there is a loss in quality associated with specific parts of the flowcell, FASTQC calculates average quality scores for each tile across all positions in the reads.

Reasons for seeing warnings or failures on this plot could be transient problems such as bubbles going through the flowcell, or they could be more permanent problems such as smudges or debris on/in the flowcell or a very high density of clusters in a tile. The most common cause of warnings in this module is the flowcell being overloaded.

It is recommended to ignore warnings/failures which mildly affected a small number of tiles for only 1 or 2 cycles, and to only pursue larger effects which showed high deviation in scores, or which persisted for several cycles.


Step 5: Check duplication levels

In a diverse library generated by shearing, most fragments will occur only once. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication indicates enrichment bias (eg PCR overamplification, contamination of the library with adapter dimers...).

The Sequence duplication levels module counts the degree of duplication for every read and creates a plot showing the relative number of reads with different degrees of duplication.


Step 6: Check per base sequence content

Since the reads are random fragments from the genome sequence, the contribution of A, C, G and T should be identical on each position.

Where do the overrepresented sequences for which no hit is found come from ?
In most cases they are adapter sequences that contain sequencing errors.

You can remove the adapter contamination by trimming adapters. There's is a lot of debate on whether it is required to do this. Some people do this rigorously, while others skip this step. If you do not remove them, reads that are contaminated with adapter sequences will simply not be mapped but their presence will slow down the mapping. So when FASTQC shows a substantial percentage of adapters (as is the case here: almost 20%) it is a good idea to remove them.

Mortasecca.png Warning: Not removing adapter contamination will affect the percentage of mapped reads during the mapping

The Overrepresented sequences module shows contamination with adapter dimers(reads that completely correspond to adapter sequences). Often you also have remnants of adapter sequences at the 3'ends of reads that come from fragments that are smaller than the read length. This form of adapter contamination is not detected by the Overrepresented sequences module.


Exercise 2: Quality control of the data of the ChIPSeq training

Exercise created by Morgane Thomas-Chollier

Use FASTQC to get basic information (read length, number of reads, global quality) of the SRR576933 ChIP and the SRR576938 control datasets.


Quality control of the ChIP dataset

Generate and view the FASTQC report of SRR576933.fastq

Knowing that it is recommended for ChIPSeq to have around 30 million reads, the number of reads in this fastq file seems very low.

Again, you see that the data set consists of very short reads although this data set is very recent. This is because it has been shown that elongating the reads does not improve your results in ChIP-Seq analysis. It will just cost you more money.

Again you see that adapter contamination is a frequently occurring problem of Illumina NGS data.


Quality control of the control dataset

Now do the same for the control data set: SRR576938.fastq.

In theory one expects that regions with high read count in the ChIP sample represent the regions that were enriched by the immunoprecipitation, i.e. the regions that were bound to the protein. However many studies have shown that the read count is affected by many factors, including GC content, mappability, chromatin structure, copy number variations... To account for these biases, a control sample is used consisting of fragmented genomic DNA that was not subjected to immunoprecipitation or that was precipitated using a non-specific antibody.

The ChIP and control samples are usually sequenced at different depths, generating files with different total number of reads. This means that these two samples have to be made comparable later on in the analysis by normalization (see ChIPSeq training).


Estimation of coverage

Knowing your organism size is important to evaluate if your data set has sufficient coverage to continue your analyses, e.g. for the human genome (3 Gb), 10 million reads are considered sufficient.

The FASTQC report has shown that the fastq files of the ChIP and control sample contain 3.6 and 6.7 million reads respectively. As you aim for 10 million reads for 3 Gb in human, we can assume that these data sets contain enough reads for proper analysis.

Exercise 3: Quality control of the paired end data of the RNASeq training

Also for paired end data you check the quality of each file separately, even if the files come from the same sample.

Use FASTQC to check the quality of the SRR1039509_1 and the SRR1039509_2 RNASeq datasets.

The quality of both files is quite high, the reports have a very low number of warnings and failures. However, there still are a few issues in these data sets. The issues are similar in both files, that makes sense since the reads in the files originate from the same sample and the same clusters on the flow cell.

This is not an issue in the first data set.


Exercise 4: Improving the quality of the data set of the introduction training in Galaxy

tutorial on quality control in Galaxy

Go to Galaxy and log on to the server.

Import data

Don’t upload the data, it takes too long. Instead import this shared history. In the right top corner click Switch to this history:


Trimmomatic15.png

Groom the file

Try to run Trimmomatic on the fastq file (use default parameters for the time being).

Groom the data.

Clean the data

Use Trimmomatic to clean the groomed FASTQ file.

Clip the most overrepresented adapter

According to the FASTQC report adapter dimers make up a fifth of the FASTQ file so it is worthwhile to remove them.

Set the accuracy of the match to 8. High accuracy thresholds for short reads will remove adapter dimers but adapter contamination at the 3'end of the reads will remain undetected, this is why we lower the accuracy to 8. A threshold of 8 corresponds to 12 perfect matches between read and adapter, so adapter contamination of 12bp and longer will be detected.

Create a fasta file with the adapter sequence that you want to clip

The minimum length of the reads after trimming should be 20. If a read is too small after trimming it will map everywhere so small reads should be removed.

Remove low quality reads

Remove reads with an average quality below 20.

  • stdout contains the feedback message that was generated by the tool. Some tools generate them to provide details of their actions and their results
  • stderr contains error messages generated by the tool

Check quality of clean reads

Check with FASTQC in Galaxy if the quality of the file was effectively improved by the Trimmomatic step (default settings should work fine).



Exercise 5: Improving the quality of the data sets of the ChIPSeq training in Galaxy

tutorial on quality control in Galaxy

Go to Galaxy and log on to the server.

Import data

Don’t upload the data, it takes too long. Instead import this shared history.

Groom the files

This time we will process both files simultaneously in batch mode using the Multiple datasets button:


Trimmomatic17.png


Clean the reads

The ChIP file contains substantial adapter dimer contamination and both files have one tile that consistently produced low quality scores during the entire sequencing process. Since DNA mappers cannot perform soft clipping we have to clean the ends of the reads. Therefore we will clip adapters in both files since we have no idea about the number of reads that partly consist of adapter sequences. We will use Trimmomatic in batch mode to clean the reads.

Clip adapters

Again, we have short reads so we lower the accuracy of the match to 8.
The minimum length of the reads after trimming should be 20.

Remove low quality reads

Remove reads with an average quality below 20.

Check the stdout report for the control sample and answer the following questions:

Check quality of clean reads

Check with FASTQC in Galaxy if the quality of the ChIP file was effectively improved by the Trimmomatic step (default settings should work fine).




References:
  1. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  2. https://sequencing.qcfail.com/