Mapping of NGS data in Galaxy
In this tutorial we will demonstrate the use of mappers to align Illumina data to a reference genome in Galaxy. In the BITS Galaxy, currently only the default mappers (as available on the Main Galaxy) are installed. See this wiki pages for a full list of mappers.
- 1 Intro
- 2 Know your data before you start mapping
- 3 Using Bowtie
- 4 SAM is the default format to report NGS alignments
- 5 Filtering on the FLAG field to remove unmapped reads
- 6 SAM can be compressed to BAM to save storage space
- 7 MPileup to summarize the alignment per position in the genome
- 8 Optional task
After insights gained by the quality analysis, the next step is to align DNA-seq reads to a reference genome. An alternative which we won't consider in this tutorial, is to assemble the reads in order to attempt to create a 'de novo' genomic sequence.
The most used mappers of DNA-seq data are BWA and Bowtie. Mappers differ in which options they can take in, how fast they are, and how accurate they are. Bowtie is faster than BWA, but looses some sensitivity (does not map an equal amount of reads to the correct position).
Know your data before you start mapping
Some crucial parameters the mapper has to know before mapping: whether your dataset is single- or paired-end. If paired-end, how far apart the reads will on average be on the genome. Also the orientation of the reads might differ! Forward and reverse reads are mostly in separate datasets: but you have to check the details of the sequencer to know the orientation of both sets (e.g. forward - forward, or forward - reverse, etc.). This information is needed by the mapper to know whether to map to the reference sequence or its reverse-complement. Check for example the ERR032031 dataset on the Short Read Archive at NCBI.
We are going to map the sample reads of the ERR032031 dataset to the human genome build 37 (aka hg19).
Galaxy shows by default a basic set of parameters which you need to fill in. The most important are the reference genome (choose Human genome bld 37), type of reads (paired end reads), the maximum expected insert size (leave the default).
You can opt to fine tune the mapping by clicking Full parameter list. I encourage you with your own data to go over these settings and change them if it suits your case. Interesting setting is to Report all valid alignments to try to report all possible alignments. These settings depends on your goal of your analysis, e.g. when looking for SNPs, you might want to disable 'Report all valid alignments', and instead have high quality values at mismatch positions ('-e' option).
For now, just leave the defaults. And execute!
SAM is the default format to report NGS alignments
When mapping is done, the one dataset generated is in SAM format. This is a kind of tab-delimited text format. View the details of this dataset by clicking on the dataset name in the history.
An important concept in Galaxy are datatypes: this dataset is in SAM format. If Galaxy knows the type, it knows by default what data is in the different columns. This counts also for other data types in Galaxy, such as BED, interval, ... You see the column names are displayed automatically in the data preview in the history and in the middle pane at the top.
SAM format contains alignment information per read
The SAM format has to fulfill strict criteria. The header part (lines starting with @) contains information about which reference sequences have been used, and the complete command line executed (after @PG).
After the header comes the reads section.
The line starts with the read name. The next column with numbers is the FLAG field: this is one number representing a combination of different states in which the aligned read is in. E.g. if there is a '4' here, it means that the read is unmapped. To decode these flags you can use this small tool to decode FLAG fields. Another important field in SAM is the MAPQ column: this number represents the mapping quality score: how certain is the mapper that this reads maps here. Unfortunately, this version of Bowtie only reports 255 (mapped) and 0 (unmapped) MAPQ values. The CIGAR field needs also some explanation: this field summarizes how the read is mapped to the reference. See the CIGAR wiki page. The column ISIZE contains the inferred insert size between the forward and reverse read. Next, you find also the sequence of the reads, and a column with optional additional information about the reads state.
For a full explanation of the SAM format, see this wiki page on SAM.
Filtering on the FLAG field to remove unmapped reads
Usually unmapped reads are not interesting to reach your goal. To remove them, we can filter SAM on the FLAG field.
|Check the first two values in the FLAG field with the this tool|
Summary: read paired read mapped in proper pair mate reverse strand first in pair
Summary: read paired read mapped in proper pair read reverse strand second in pair
To filter, find the Filter SAM on bitwise flag values tool in Galaxy.
Click on Add new Flag. A new section appears.
Since we want to keep the mapped reads, we need reads in which the bit 0x0004 is NOT set (because if it's set, the read is not mapped. Hence the settings in the filter sam tool are:
You can set more flag filters if you want. For now, click execute to run the tool. Rename the resulting data set to 'Bowtie mapped reads ERR032031'
SAM can be compressed to BAM to save storage space
BAM, the binary alignment format is a compressed form of a SAM file. It can in most cases used instead of SAM to subsequent tools. So it's a good practice to convert SAM to BAM. Note: the contents of a BAM file cannot be viewed, since it is binary format (and not text anymore).
|Task: convert the SAM output of Bowtie to BAM using 'SAM-to-BAM converts SAM format to BAM format' tool.|
Rename the new data set to 'Bowtie mapped reads ERR032031 BAM'
MPileup to summarize the alignment per position in the genome
To get a report of the alignment summarized per position in the genome, you can use the MPileup tool of samtools package.
As input, choose the BAM file of the alignment. Make also sure you use the hg19 genome build.
The mpileup tool provides two output datasets: one text file containing logs of the tool and the more interesting pileup formatted data set. The pileup format provides for each genomic position a summary of how many reads are mapped on that position, and whether they differ from the reference. See this page for more info about the pileup format.
Filtering the pileup
We can filter the pileup to exclude genomic positions with low coverage (meaning those positions to which not many reads map). For this we use Filter pileup on coverage and SNPs tool in Galaxy. We will filter on positions with at least 3 times coverage, and only where a difference with the reference is reported. In addition, we will also display the mapping quality. Note: the term 'quality' in this tool refers to the mapping quality of the read to the genome (this has nothing to do anymore with the base calling quality of the fastq file).
Further, we will also convert the simple pileup format, which contains one genomic coordinate per line, to an interval format by setting this parameter. In summary, the parameter settings look like:
Run this tool!
Important: the output is now a data set in interval format. Every region of the genome is now indicated with two coordinates. So our single position in indicated with two coordinates, the one on which the feature is located, and the position right behind it. This is what is called a format with a 'closed start' and 'half-open end' format. Why? Such a format eases computation with genomic coordinates. Interval format is therefore the most native format of Galaxy. You might check this page on Galaxy datatypes] to understand the interval format a bit better (and its related format, the bed format).
There are some pretty advances to the interval format: we now visualize the detected SNPs in a genome browser. Later on, we can do computation with genomic intervals.
Visualize the detected SNPs
Galaxy comes with a visualization tool, called Trackster. This is a separate part of Galaxy, next to the 'Analysis part' we have been using up to now. To visualize datasets, click on the icon in the filtered pileup dataset.
Click on 'New visualisation'
Give the new visualisation a name like 'SNPs from ERR032031'. Make sure it is visualized against hg19 (what we've been using to map).
If Trackster loads this dataset for the first time, it can take a while. When loading is complete, you have to choose which chromosome you want to visualize: now choose 'chr21.
In trackster, you can save the state of your visualisation by click in the top-right corner on . You can navigate in trackster by dragging with the mouse on the genome ruler, and by clicking -/+ in the top bar. An interesting feature is that you can bookmark a location, to come back to it later.
We note that most SNPs are being detected toward the end in Chromosome 21 of hg19.
For now, we stop or short first encounter with Trackster. Later we will add more tracks (= annotated genomic positions) to our visualisation. Let's now save our visualisation. Next click on 'Analyze data' in the top bar to go back to the analysis section.
Repeat the same steps with the trimmed DRR00542 dataset, using the aligner BWA against hg19. Note that the SAM output of BWA provides detailed mapping quality values (in contrast to Bowtie). BWA is a little bit slower, but more sensitive then Bowtie.