Quality control of NGS data

[ Main_Page | NGS data analysis | Downloading NGS data from NCBI | Improving the quality of NGS data ]

Before you analyze the data, it is crucial to check the quality of the data.
We will use the standard tool for checking the quality of data generated on the Illumina platform: FASTQC ^[1]

Correct interpretation of the report that FASTQC generates is very important.
If the quality of your data is good, you can proceed with the analysis.

!! If the quality of your data is very bad, don't immediately throw the data in the recycle bin but contact an expert and ask for his/her opinion. !!

FASTQC opens fastq files to check the quality of the data. A FASTQC report is generated for each fastq file separately.

Install FASTQC. Go to the FASTQC folder and double click the run_fastqc.bat file to start the program on Windows.

Known sources of error in Illumina sequencing are commented on a QCFAIL site developed by the FastQC team ^[2]

1 Exercise 1: Quality control of the data of the introduction training
2 Exercise 2: Quality control of the data of the ChIPSeq training
3 Exercise 3: Quality control of the paired end data of the RNASeq training
4 Exercise 4: Improving the quality of the data set of the introduction training in Galaxy
5 Exercise 5: Improving the quality of the data sets of the ChIPSeq training in Galaxy

Exercise 1: Quality control of the data of the introduction training

For this exercise we will use a fastq file that can be downloaded from NCBI or from the BITS website.

Step 1: Opening NGS data in FASTQC

FastQC is relatively self explanatory. Help can be found in the manual.

Open the file that we have downloaded from NCBI in FASTQC
Go to the top menu, select File and click Open.

During loading of the file, the software keeps you informed about the progress that is being made.
Once the file is opened, the software automatically analyses the data:

FASTQC consists of multiple modules each checking a specific aspect of the quality of the data. In the left menu you can select the module you wish to view. By default it shows the results of the Basic statistics module where you can see the number and the length of the reads...

The names of the modules are preceded by an icon that reflects the quality of the data. The icon indicates whether the results of the module seem entirely normal (green tick), slightly abnormal (orange triangle) or very unusual (red cross).

These evaluations must be taken in the context of what you expect from your library. A 'normal' sample as far as FastQC is concerned is random and diverse. Some experiments are expected to produce libraries which are biased in particular ways. You should treat the icons as pointers to where you should concentrate your attention on and understand why your library may not look normal.

The basic statistics module shows that all sequence reads are 36 bases long.

Step 2: Check sequence quality per position

Phred scores represent base call quality. The higher the score the more reliable the base call. Often the quality of reads degrades over the length of the read. Therefore, it is common practice to determine the average quality of the first, second, third,...nth base of the reads by plotting the distribution of the Phred scores on each position of the reads using box plots.

Evaluate the quality scores per position
In the left menu, click Per base sequence quality: The y-axis on the graph shows the Phred quality scores, the x-axis shows the position in the read. So again you can see that the sequence reads in this file are 36 bases long. The average Phred score is depicted by the blue line, the median Phred score by the red line. The yellow boxes contain 50% of all Phred scores on a certain position. As expected the quality is steadily declining. The background of the graph divides the y-axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red; Phred score < 20). As you can see the Phred scores of this data set are OK.

Remark: In new Illumina kits the sequence quality goes up a bit first before it steadily declines.

Step 3: Check overall sequence quality

Instead of showing the quality of each position separately, you can calculate the average Phred score of each read and show a cumulative plot of the average qualities of all the reads.

Evaluate the overall quality
In the left menu, click Per sequence quality scores: The y-axis on the graph shows the number of reads, the x-axis shows the Phred score. Most reads have an average Phred score of 32. This seems a very high score (Phred scores of Illumina calls range from -5 to 41). However, you also have to take into account the length of the reads: the quality decreases over the length, so the longer the reads the smaller the average read quality. Nevertheless, the average Phred scores of the data are fine.

Step 4: Check quality per tile

Illumina flowcells are divided into tiles. To see if there is a loss in quality associated with specific parts of the flowcell, FASTQC calculates average quality scores for each tile across all positions in the reads.

Evaluate the quality per tile
In the left menu, click Per tile sequence quality: The y-axis on the graph shows the tile number, the x-axis shows the position in the reads. The plot shows the deviation from the average quality for each tile. The colours are on a cold to hot scale, with blue being the average quality and other colours representing positions where the quality was different from the average. In the example data you can see that one tile shows poor quality over two positions. A good plot should be blue all over. Although the plot isn't entirely blue the results of this module are still acceptable.

Reasons for seeing warnings or failures on this plot could be transient problems such as bubbles going through the flowcell, or they could be more permanent problems such as smudges or debris on/in the flowcell or a very high density of clusters in a tile. The most common cause of warnings in this module is the flowcell being overloaded.

It is recommended to ignore warnings/failures which mildly affected a small number of tiles for only 1 or 2 cycles, and to only pursue larger effects which showed high deviation in scores, or which persisted for several cycles.

Step 5: Check duplication levels

In a diverse library generated by shearing, most fragments will occur only once. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication indicates enrichment bias (eg PCR overamplification, contamination of the library with adapter dimers...).

The Sequence duplication levels module counts the degree of duplication for every read and creates a plot showing the relative number of reads with different degrees of duplication.

Evaluate the sequence duplication levels
In the left menu, click Sequence duplication levels: The y-axis on the graph shows the percentage of occurrence, the x-axis shows the duplication level. The blue line represents the counts of all the sequences that are duplicated. The percentage is computed relative to the total number of reads. The red line represents the number of distinct sequences that are duplicated. The percentage is computed relative to the total number of distinct sequences in the data (see slides). In a diverse library most sequences fall into the far left of the plot. Oversequencing of a small library will flatten the lines, lowering the low end and raising other categories. The presence of contaminants will produce spikes on the right of the plot. These peaks will appear in the blue trace as they make up a high proportion of the original library, but usually disappear in the red trace as they make up a small proportion of the deduplicated set. If peaks persist in the red trace then this suggests that there are a large number of different highly duplicated sequences which might indicate either a contaminant set or a very severe technical duplication. In our file, you see that there are few sequences with very high duplication levels (typical for adapter contamination), but there is also a substantial number of sequences with lower duplication levels (< 50).

Step 6: Check per base sequence content

Since the reads are random fragments from the genome sequence, the contribution of A, C, G and T should be identical on each position.

Evaluate the per base sequence content
In the left menu, click Per base sequence content: The y-axis on the graph shows the percentage of occurrence, the x-axis shows the position in the read. On this plot you should see straight lines for the four nucleotides. In reality you often see that this is not the case for the first positions. Libraries produced by priming using random hexamers (nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the first positions of the reads. This bias does not come from a single sequence, but because of enrichement of a number of different K-mers at the 5' end of the reads. So it isn't something you can correct by trimming (you do not have a specific sequence that you can trim of the reads). In most cases it doesn't seem to adversely affect the downstream analysis but it will produce a warning or failure in this module. In our file, however, you see that the plot is completely off: there is considerable bias in each position, giving a failure for this plot.

Where does the observed bias come from ?
In the left menu, click Overrepresented sequences: Failures in the Per base sequence content plot are often related to contamination of your library. If one specific read is making up a substantial fraction of your library, the sequence of that read will distort the plot (the percentage of bases that you see in each position will be greatly influenced by the sequence of the read). You can check for contaminating sequences using the Overrepresented sequences module: it lists all sequences which make up more than 0.1% of the total. For each sequence in the list the module will look for matches in a database of common contaminants and will report the best hit it finds. As you can see the module finds many contaminants, some of them could be identified as Illumina adapters.

Where do the overrepresented sequences for which no hit is found come from ?
In most cases they are adapter sequences that contain sequencing errors.

You can remove the adapter contamination by trimming adapters. There's is a lot of debate on whether it is required to do this. Some people do this rigorously, while others skip this step. If you do not remove them, reads that are contaminated with adapter sequences will simply not be mapped but their presence will slow down the mapping. So when FASTQC shows a substantial percentage of adapters (as is the case here: almost 20%) it is a good idea to remove them.

Warning: Not removing adapter contamination will affect the percentage of mapped reads during the mapping

The Overrepresented sequences module shows contamination with adapter dimers(reads that completely correspond to adapter sequences). Often you also have remnants of adapter sequences at the 3'ends of reads that come from fragments that are smaller than the read length. This form of adapter contamination is not detected by the Overrepresented sequences module.

Exercise 2: Quality control of the data of the ChIPSeq training

Exercise created by Morgane Thomas-Chollier

Use FASTQC to get basic information (read length, number of reads, global quality) of the SRR576933 ChIP and the SRR576938 control datasets.

Quality control of the ChIP dataset

Generate and view the FASTQC report of SRR576933.fastq

How many reads does the file contain ?
This is one of the results of the Basic statistics module in FASTQC (red):

Knowing that it is recommended for ChIPSeq to have around 30 million reads, the number of reads in this fastq file seems very low.

Should we be concerned about the low number of reads in the sample ?
No it's not a problem because the sample comes from E. coli. This bacterium has a very small genome so 3 million reads will still generate high coverage. However, if this was a human or mouse sample the number of reads would be way too low and we would indeed be concerned.

What is the length of the reads ?
This is one of the results of the Basic statistics module in FASTQC (green):

Again, you see that the data set consists of very short reads although this data set is very recent. This is because it has been shown that elongating the reads does not improve your results in ChIP-Seq analysis. It will just cost you more money.

Are there any positions with low sequence quality ?
This is shown in the Per base sequence quality module in FASTQC: The overall sequence quality is good, although it drops sharply at the last position, but this is normal in Illumina data, so this feature is not raising hard concerns.

What could be the cause of the failure in the per base sequence content plot ?
The content of the 4 nucleotides is far from constant over all positions: This typically point the presence of adapter or other contaminating sequences in your reads.

Which FASTQC module allows you to confirm this suspicion ?
The Overrepresented sequences module will show if your read file is enriched in known contaminants.

What does this module tell you ?
The Overrepresented sequences module shows a high percentage of adapter sequencess (29% !).

Again you see that adapter contamination is a frequently occurring problem of Illumina NGS data.

What about sequence duplication levels ?
There is sequence duplication. Adapter contamination will be partly responsible for the high duplication levels (the blue peaks at the far right of the plot) but the main cause lies in the technique itself. Typically, after ChIP, you end up with a very small initial amount of DNA (antibodies are not that effective, many cleanup steps in the protocol,...) and you have to do PCR to get your library up to a proper size for sequencing. So naturally, you expect many clones of the same DNA fragment due to the small initial size of the library.

Quality control of the control dataset

Now do the same for the control data set: SRR576938.fastq.

In theory one expects that regions with high read count in the ChIP sample represent the regions that were enriched by the immunoprecipitation, i.e. the regions that were bound to the protein. However many studies have shown that the read count is affected by many factors, including GC content, mappability, chromatin structure, copy number variations... To account for these biases, a control sample is used consisting of fragmented genomic DNA that was not subjected to immunoprecipitation or that was precipitated using a non-specific antibody.

How many reads are there in the control data set ?
This is one of the results of the Basic statistics module in FASTQC. You see that the control data set contains the double amount of reads as the ChIP data set.

The ChIP and control samples are usually sequenced at different depths, generating files with different total number of reads. This means that these two samples have to be made comparable later on in the analysis by normalization (see ChIPSeq training).

What is the length of the reads in the control data set ?
This is one of the results of the Basic statistics module in FASTQC. You see that the control data set contains reads of 36 bases just like the ChIP data set.

Are there any positions with low sequence quality ?
This is shown in the Per base sequence quality module in FASTQC: The overall sequence quality is good, although it drops sharply at the last position, but this is normal in Illumina data, so this feature is not raising hard concerns.

Why did the per base sequence quality plot raise a failure in the ChIP sample and not in the control ?
In the slides you can see that the thresholds for a warning are: end of box < 10 median < 25 On the figure you see that the culprit is the median: In the ChIP sample the median Phred score of the last position is 21 (so below 25) raising a failure In the control sample the median Phred score of the last position is 26 (so above 25)

Which FASTQC module gives a failure ?
The Per tile sequence quality module. The quality of one of the tiles is consistently different from the rest of the tiles.

Is this also the case in the ChIP sample ?
Yes, you see exactly the same problem in the ChIP sample. Since both samples were probably loaded on the same lane, it seems normal that you see the same problem in the ChIP sample.

Why does the Sequence duplication levels modules give a failure in the control sample ?
The duplication levels in the control data set are high. There are a high number of sequences with low duplication levels which could be due to high coverage. Remember that you are working with E. coli which has a small genome.

Estimation of coverage

Knowing your organism size is important to evaluate if your data set has sufficient coverage to continue your analyses, e.g. for the human genome (3 Gb), 10 million reads are considered sufficient.

What is the size of the genome of the E. coli K-12 strain substr. MG1655?
Go to the NCBI website Select the Genome database to search in Type Escherichia coli in the search term box Click Search Expand the Representative section The genome is 4.64 Mbase.

The FASTQC report has shown that the fastq files of the ChIP and control sample contain 3.6 and 6.7 million reads respectively. As you aim for 10 million reads for 3 Gb in human, we can assume that these data sets contain enough reads for proper analysis.

Exercise 3: Quality control of the paired end data of the RNASeq training

Also for paired end data you check the quality of each file separately, even if the files come from the same sample.

Use FASTQC to check the quality of the SRR1039509_1 and the SRR1039509_2 RNASeq datasets.

The quality of both files is quite high, the reports have a very low number of warnings and failures. However, there still are a few issues in these data sets. The issues are similar in both files, that makes sense since the reads in the files originate from the same sample and the same clusters on the flow cell.

Which FASTQC module gives a failure for the second data set?
The Per tile sequence quality module. There are 8 tiles in the second fastq file that have 3 to 4 low quality bases.

This is not an issue in the first data set.

How is this possible?
Although the two files represent the same clusters on the flow cell they were generated by two separate sequencing events, done on different days. In the second sequencing run air bubbles passed through the flow cell while in the first run this did not happen.

Which FASTQC module also reflects this issue?
The Per base N content module. Because of the passing air bubble, the signals emitted by the clusters were not clear on these 8 tiles. If the signal is not clear the sequencer does not know what base to call. If he doesn't know, he will call an N. So the sequencer called N's on the positions that were affected by the passing air bubble.

Should we be worried about the warnings in the Sequence Duplication Levels module?
No, it's RNASeq data: the duplication that is observed was generated by the highly expressed genes. There's no clear peak at the right end of the plot so there is no or low adapter dimer contamination.

Is this conclusion confirmed by the Overrepresented Sequences module?
Yes, there's very limited adapter dimer contamination in these files.

Should we be concerned by the adapter dimer contamination?
No, the amount of adapter dimer contamination is too low to be of any concern.

Exercise 4: Improving the quality of the data set of the introduction training in Galaxy

tutorial on quality control in Galaxy

Go to Galaxy and log on to the server.

Import data

Don’t upload the data, it takes too long. Instead import this shared history. In the right top corner click Switch to this history:

Groom the file

Try to run Trimmomatic on the fastq file (use default parameters for the time being).

Does Trimmomatic recognize the input file?
No, it doesn't automatically appear in the Input FASTQ file box:

Why?
The FASTQ file is messy, as a result Trimmomatic doesn’t recognize it as a potential input file.

How can we fix this?
We can fix this by grooming the file.

Groom the data.

What is the ASCII encoding of the quality scores in the FASTQ file?
You can find the encoding in the Basic statistics module of the FASTQC report.

How many sequences are in the data set?
Click the name of the groomed file in the history to see the details of the file.

Clean the data

Use Trimmomatic to clean the groomed FASTQ file.

Clip the most overrepresented adapter

According to the FASTQC report adapter dimers make up a fifth of the FASTQ file so it is worthwhile to remove them.

Set the accuracy of the match to 8. High accuracy thresholds for short reads will remove adapter dimers but adapter contamination at the 3'end of the reads will remain undetected, this is why we lower the accuracy to 8. A threshold of 8 corresponds to 12 perfect matches between read and adapter, so adapter contamination of 12bp and longer will be detected.

Create a fasta file with the adapter sequence that you want to clip

Where can you find that sequence?
The sequence of the adapter is in the Overrepesented sequences table of the FASTQC report.

The minimum length of the reads after trimming should be 20. If a read is too small after trimming it will map everywhere so small reads should be removed.

Remove low quality reads

Remove reads with an average quality below 20.

How many reads remain after trimming and filtering?
Click the View details icon of the trimmed file in the history: Click the stdout link: In the stdout report look at the Surviving info on the last line.

stdout contains the feedback message that was generated by the tool. Some tools generate them to provide details of their actions and their results
stderr contains error messages generated by the tool

What fraction of reads was removed?
In the stdout report look at the Dropped info on the last line.

Does it correspond to the 20% of adapter contamination that FASTQC reported?
No, 3200689 reads (33.27%) were dropped while FASTQC reported around 20% adapter dimers.

Why?
You not only discard adapter dimers but also reads that become too short after adapter trimming and reads with average quality below 20.

Check quality of clean reads

Check with FASTQC in Galaxy if the quality of the file was effectively improved by the Trimmomatic step (default settings should work fine).

Were adapter dimers effectively removed?
The overrepresented sequences module of FASTQC does no longer show the adapters.

Is the length of the reads still 36 bases?
The length of the reads varies from 20 to 36 bases because reads that consisted partly of adapter sequence have been trimmed. There are no reads < 20 bases because you asked Trimmomatic to discard them.

Why do we still have a large percentage of duplicates even after adapter removal?
High duplication levels are still an issue but that’s normal since this is RNASeq data. Although you did remove the adapter dimers (peak at the right end of the duplication plot) there are still many duplicates that originate from highly expressed genes.