Improving the quality of NGS data

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS data analysis | Quality control of NGS data | Mapping of NGS data ]

Group exercises

You can solve most quality issues found by FASTQC e.g. trimming contaminating adapters, low quality bases at the end of your reads, filtering low quality reads....

There's is a lot of debate on whether it is required to do this. Reads that are contaminated with adapter sequences will not map but if these reads make up a large fraction of the total number of reads they might slow down the mapping a lot. While it is true that mappers can use noisy info (still containing adapters, low quality bases...), the mapping results will be negatively affected by this noise.

Cleaning is in my opinion worthwhile especially when working with small reads and in case of extensive adapter contamination (almost always).

Quality control in Galaxy

Links:

Galaxy is a bioinformatics server that contains many tools, data and analysis results. Before you can upload your data to Galaxy, you have to register or log in to Galaxy (see slides).

Upload data to Galaxy

If you want to work on your data in Galaxy, you have to first get the data into Galaxy. To accomplish this you can use the Upload file tool in the Get data section.

Instead I shared the file on Galaxy so you can import it using this link. Make sure that you are logged on to Galaxy before you do this. When you click this link you are redirected to a web page where you can import the file:

Galaxy6a.png


The history

Data sets that are uploaded or created by running a tool appear in the history in the right Galaxy pane.

To give a history a new name, click the history's current name, type a new one and hit enter.

Clicking the name of a data set unfolds a preview, a short description and tools to manipulate the data.

Icons in the History

  • Clicking the floppy (Download) icon will download the file to your computer
  • To visualize a file in the middle pane, click the eye (View data) icon next to the name of the file.

Colors of files in the History

Data sets in the history have different colors representing different states.

  • Grey: The job is placed in the waiting queue. You can check the status of queued jobs by refreshing the History pane.
  • Yellow: The job is running.
  • Green: When the job has been run the status will change from yellow to green if completed successfully.
  • Red: When the job has been run the status will change from yellow to red if problems were encountered.

Running Groomer in Galaxy

If you select a tool in Galaxy it will automatically detect all data sets in your history that it can use as input. In the case shown below the tool does not recognize the fastq file in the history:

Galaxy10b.png

The fact that the tool does not recognize the fastq file means that the fastq file is so messy that the tool can't read it. Remember that there is a tool to clean messy fastq files: FASTQ Groomer

Check the quality encoding in your fastq file (e.g. in FASTQC), and click the Execute button to start the tool:

Galaxy12a.png

Grooming takes long (30 min when Galaxy traffic is low). You can choose to wait but if it takes too long you can click the Delete button in the History (see slides) to stop the tool. I have provided the groomed file: import it in Galaxy using this link.


Using Trimmomatic in Galaxy

To clean your data use the Trimmomatic tool in the Quality Control section of tools. Click the name of the tool to display its parameters in the middle pane.

See this page for an overview of the Trimmomatic parameters.

A bit more explanation:

  • The input file with the reads: Galaxy will automatically suggest a file from your History that has the right format, in this case: a fastq file. If Galaxy doesn't make a suggestion it means it cannot find any files in your History with the right format.
  • The sequence of the adapter: provide a custom sequence. If you analyze your own data you know which adapter sequences were used. Since this is public data we don't really know the name of the adapter. However, remember that FASTQC gives you a list of contaminating adapter sequences so you have the sequence of the adapter. Choose custom adapter sequence and paste the adapter sequence from FASTQC. You can only enter one sequence.

Click Execute to run the tool.

In the history you see a new item, colored in yellow as long as the tool is running. Regularly hit the Refresh button in the History to check if the tool has finished. Clipping should go fast, after a few minutes you should have the result.


Running FASTQC in Galaxy

Search for FASTQC in the tools pane and click the resulting FastQC link to open the parameter settings in the middle pane:

Galaxy18b.png

FASTQC automatically recognizes all files it can use as an input. Select the file you want to use.
The FASTQC implementation in Galaxy can take an optional file containing a list of contaminants. If you don't specify one, FASTQC will look for standardly used Illumina adapters.

In most cases you keep the default settings and click Execute.


Quality control in GenePattern

Genepattern is very similar to Galaxy. It's as user-friendly as Galaxy, allows analysis of NGS data just like Galaxy...

It provides easy access to hundreds of tools for different kinds of analyses (e.g. RNA-seq, microarray, proteomics and flow cytometry, sequence variation, copy number and network analysis) via a web browser.

Links

Consult the GenePattern tutorial for more info.

Running Groomer in GenePattern

The Broad Genepattern server does not contain the Groomer tool, but we have added the tool to our BITS Genepattern server.

  • Search for the Groomer tool in GenePattern.
  • Define the parameters: one of the parameters you need to define is Input format: the encoding of the fastq file you want to clean. The encoding is important because it determines the offset of the quality scores (ASCII offset 33 or ASCII offset 64). If you're not sure you can check the encoding of your file in the FastQC report (take into account that FastQC sporadically makes the wrong guess).
    GP9.png
  • Run the Groomer tool.

Running FastQC in GenePattern

  • Search for the FASTQC tool
  • Fill in the parameters
  • Run the FASTQC tool

You can open the resulting HTML report in your browser:

  • Click the name of the output file at the bottom of the page
  • Select Open Link
    GP18.png

Running Trimmomatic in GenePattern

In GenePattern you can improve the quality of your NGS data using the Trimmomatic tool.

  • Search for the Trimmomatic tool
  • Fill in the parameters: See this page for an overview of the Trimmomatic parameters.
  • Run Trimmomatic

Removing adapters using command line tools

See exercise on using cutadapt to trim adapter sequences


References: