Improving the quality of NGS data
You can solve most quality issues found by FASTQC e.g. trimming contaminating adapters, low quality bases at the end of your reads, filtering low quality reads....
There's is a lot of debate on whether it is required to do this. Reads that are contaminated with adapter sequences will not map but if these reads make up a large fraction of the total number of reads they might slow down the mapping a lot. While it is true that mappers can use noisy info (still containing adapters, low quality bases...), the mapping results will be negatively affected by this noise.
Cleaning is in my opinion worthwhile especially when working with small reads and in case of extensive adapter contamination (almost always).
- 1 Quality control in Galaxy
- 2 Quality control in GenePattern
- 2.1 Access GenePattern
- 2.2 The GenePattern user interface
- 2.3 Creating a folder in GenePattern
- 2.4 Searching a tool in GenePattern
- 2.5 Running tools in GenePattern
- 2.6 Store the output of a tool in GenePattern
- 2.7 Running Groomer in GenePattern
- 2.8 Running FastQC in GenePattern
- 2.9 Running Trimmomatic in GenePattern
- 3 Removing adapters using command line tools
Quality control in Galaxy
- European Galaxy.
- Raw Arabidopsis data in European Galaxy
- Groomed Arabidopsis data in European Galaxy
- Clean Arabidopsis data in European Galaxy
- Raw E. coli data in European Galaxy
- Groomed E. coli data in European Galaxy
- Filtered E. coli data in European Galaxy
- Main Galaxy.
- Raw Arabidopsis data in main Galaxy
- Groomed Arabidopsis data in main Galaxy
- our Galaxy tutorial (can be outdated)
Galaxy is a bioinformatics server that contains many tools, data and analysis results. Before you can upload your data to Galaxy, you have to register or log in to Galaxy (see slides).
Upload data to Galaxy
If you want to work on your data in Galaxy, you have to first get the data into Galaxy. To accomplish this you can use the Upload file tool in the Get data section.
Instead I shared the file on Galaxy so you can import it using this link. Make sure that you are logged on to Galaxy before you do this. When you click this link you are redirected to a web page where you can import the file:
Data sets that are uploaded or created by running a tool appear in the history in the right Galaxy pane.
To give a history a new name, click the history's current name, type a new one and hit enter.
Clicking the name of a data set unfolds a preview, a short description and tools to manipulate the data.
Icons in the History
- Clicking the floppy (Download) icon will download the file to your computer
- To visualize a file in the middle pane, click the eye (View data) icon next to the name of the file.
Colors of files in the History
Data sets in the history have different colors representing different states.
- Grey: The job is placed in the waiting queue. You can check the status of queued jobs by refreshing the History pane.
- Yellow: The job is running.
- Green: When the job has been run the status will change from yellow to green if completed successfully.
- Red: When the job has been run the status will change from yellow to red if problems were encountered.
Running Groomer in Galaxy
If you select a tool in Galaxy it will automatically detect all data sets in your history that it can use as input. In the case shown below the tool does not recognize the fastq file in the history:
The fact that the tool does not recognize the fastq file means that the fastq file is so messy that the tool can't read it. Remember that there is a tool to clean messy fastq files: FASTQ Groomer
Check the quality encoding in your fastq file (e.g. in FASTQC), and click the Execute button to start the tool:
Grooming takes long (30 min when Galaxy traffic is low). You can choose to wait but if it takes too long you can click the Delete button in the History (see slides) to stop the tool. I have provided the groomed file: import it in Galaxy using this link.
Using Trimmomatic in Galaxy
To clean your data use the Trimmomatic tool in the Quality Control section of tools. Click the name of the tool to display its parameters in the middle pane.
See this page for an overview of the Trimmomatic parameters.
A bit more explanation:
- The input file with the reads: Galaxy will automatically suggest a file from your History that has the right format, in this case: a fastq file. If Galaxy doesn't make a suggestion it means it cannot find any files in your History with the right format.
- The sequence of the adapter: provide a custom sequence. If you analyze your own data you know which adapter sequences were used. Since this is public data we don't really know the name of the adapter. However, remember that FASTQC gives you a list of contaminating adapter sequences so you have the sequence of the adapter. Choose custom adapter sequence and paste the adapter sequence from FASTQC. You can only enter one sequence.
Click Execute to run the tool.
In the history you see a new item, colored in yellow as long as the tool is running. Regularly hit the Refresh button in the History to check if the tool has finished. Clipping should go fast, after a few minutes you should have the result.
Running FASTQC in Galaxy
Search for FASTQC in the tools pane and click the resulting FastQC link to open the parameter settings in the middle pane:
FASTQC automatically recognizes all files it can use as an input. Select the file you want to use.
The FASTQC implementation in Galaxy can take an optional file containing a list of contaminants. If you don't specify one, FASTQC will look for standardly used Illumina adapters.
In most cases you keep the default settings and click Execute.
Quality control in GenePattern
Genepattern is very similar to Galaxy. It's as user-friendly as Galaxy, allows analysis of NGS data just like Galaxy...
It provides easy access to hundreds of tools for different kinds of analyses (e.g. RNA-seq, microarray, proteomics and flow cytometry, sequence variation, copy number and network analysis) via a web browser.
- BITS Genepattern server
- fasta file containing Arabidopsis adapter sequence
- fasta file containing E. coli adapter sequence
- Overview of Trimmomatic parameters
You can work on the BITS Genepattern server. Ask the trainer for login details.
The GenePattern user interface
Logging in brings you to the Genepattern homepage:
- Click the GenePattern icon at the top of the page (red) to return to this home page at any time.
- The upper right corner shows your user name (green).
- The navigation tabs (blue) provide access to other pages.
We'll zoom in on the navigation tabs:
- The Modules tab gives access to the tools that you can run. Enter the first few characters of a module in the search box to locate a tool. Click the Browse modules button to list the tools.
- The Jobs tab shows an overview of the analyses that you have done by showing the tools that you have run, together with a list of output files that were generated.
- The Files tab shows a list of files you can use as input for the tools. These are files that you have uploaded from your hard drive or files that were generated as the output of a tool and that were saved to the Files tab. In your case the Files tab contains a folder uploads.
Creating a folder in GenePattern
To create a subfolder in the uploads folder on the Files tab
- Click uploads. A window will appear where you can choose to create a subfolder, upload files or upload a folder.
- Click Create subdirectory
- Type a name for the subfolder
- Click Create
Searching a tool in GenePattern
You can find a module by typing its name into the search box on the Modules tab:
Searching a tool makes its name appear in the main window.
Running tools in GenePattern
Clicking the name of the tool will open its parameter form in the main window.
Fill in the parameters. To obtain a description of the parameters of a tool and their default values click the Documentation link at the top of the page.
Many input files are located in the SHARED_DATA folder in the subfolder BITS_trainingdata. To load such an input file:
- click the Add path or URL button
- expand SHARED_DATA
- expand BITS_trainingdata
- select the file
- click the Select button
If you want to use a file in the Files tab as an input, simply drag it from the Files tab and drop it in the area of the parameter form that is labeled Drag files here.
Click Run to start the analysis.
As long as the tool is running you see an arched arrow in the top right corner:
When the tool has finished the arched arrow is replaced by a checkmark and the file(s) containing the results appear at the bottom:
Note that apart from the file containing the results, other files are generated e.g. stdout.txt containing the error log of the tool. You can consult the error log in case of problems.
Store the output of a tool in GenePattern
Copy the file in the uploads folder on the Files tab to store it permanently and allow to use it as input for other tools. Output files that are not saved in the uploads folder are stored 7 days on the server and are visible via the Jobs tab.
When a tool has finished output files are generated at the bottom of the page.
- Click the name of the output file.
- Select Copy to Files Tab
Running Groomer in GenePattern
The Broad Genepattern server does not contain the Groomer tool, but we have added the tool to our BITS Genepattern server.
- Search for the Groomer tool in GenePattern.
- Define the parameters: one of the parameters you need to define is Input format: the encoding of the fastq file you want to clean. The encoding is important because it determines the offset of the quality scores (ASCII offset 33 or ASCII offset 64). If you're not sure you can check the encoding of your file in the FastQC report (take into account that FastQC sporadically makes the wrong guess).
- Run the Groomer tool.
Running FastQC in GenePattern
- Search for the FASTQC tool
- Fill in the parameters
- Run the FASTQC tool
You can open the resulting HTML report in your browser:
- Click the name of the output file at the bottom of the page
- Select Open Link
Running Trimmomatic in GenePattern
In GenePattern you can improve the quality of your NGS data using the Trimmomatic tool.
- Search for the Trimmomatic tool
- Fill in the parameters: See this page for an overview of the Trimmomatic parameters.
- Run Trimmomatic
Removing adapters using command line tools