Mapping of NGS data

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS data analysis | Improving the quality of NGS data | Linux command line ]

After quality control, the next step is to align the reads to a reference sequence.

The reference is in most cases the full genome sequence but sometimes, a library of EST sequences is used. In either way, aligning your reads to a reference sequence is called mapping.

The most used mappers are BWA[1] and Bowtie[2] for DNASeq data and Tophat[3], STAR[4], or HISAT2 for RNASeq data.

Mappers differ in methodology, parameters, how fast and how accurate they are and whether they tolerate spliced alignments or not (relevant for RNA-Seq). Bowtie is faster than BWA, but looses some sensitivity (does not map an equal amount of reads to the correct position in the genome as BWA). BWA and Bowtie cannot align spliced reads while Tophat, STAR and HISAT2 can.

At the moment STAR is the most popular RNASeq mapper and HISAT2 is being pushed over TopHat.

Mapping

Links:

1. Run STAR with default parameters (single pass) on the trimmed and groomed SRR074262 file.
Don't forget to use the built-in indexed Arabidopsis thaliana TAIR10 genome. View the log file to get an overview of the mapping results.

2. Run STAR with default parameters on the trimmed and groomed SRR074262 file but set maximum intron length to 3000 (it's Arabidopsis data). When you map RNASeq reads to the genome the alignments will have gaps (introns). This parameter specifies the maximum size of these gaps.
You can find a detailed description of the STAR parameters on this page

3. Run STAR with default parameters on the trimmed and groomed SRR074262 file but set maximum intron length to 3000 and maximum number of mismatches to 2. This is the number of mismatches you allow in the alignment. Typically you set it to 5% of the read length but the choice depends on the quality of the reads.

4. Run STAR with default parameters on the trimmed and groomed SRR074262 file but set maximum intron length to 3000 and maximum number of mismatches to 2. Use two pass mode: two-pass mode gives a much more accurate mapping than single-pass mode.


Exercise 2: Mapping DNASeq reads

Exercise created by Morgane Thomas Chollier

Links:

Map the reads of the groomed and trimmed control sample with Bowtie to the E.coli genome.

Obtaining the reference genome

In the ChIP-Seq experiment of E. coli we want to see which genomic regions are bound to transcription factor FNR by comparing the reads from the ChIP sample to those of the control sample. However, at this point what we have is a set of reads that are identified by their location on the flow cell. To answer our question we should link the reads to regions in the genome to obtain their genomic coordinates. This process is called mapping.

For Illumina reads the standard mappers are BWA and Bowtie (version 1 and 2).

Check the manual on the Bowtie website.

Bowtie needs the complete genome, in FASTA format as a reference sequence to align the reads to.

The genome sequence of E. coli K-12 MG1655 needs to be in a specific format (=index) for bowtie. Several pre-built indexes are available to download on the bowtie webpages or the iGenomes website.

Although the E. coli sequence is available we will not use it to show you how you should proceed if you don't find your reference sequence on this website. In that case you will need to make the index yourself.

If you can't find your reference on the iGenomes website you have to download it from:

Since Ensembl focuses on higher eukaryotes, we are going to download the genome from NCBI.

NCBI Nucleotide is notorious for the amount of errors it contains both in sequences and in annotations. Therefore, if available you should always use sequences from RefSeq, the clean subset of NCBI’s Nucleotide database. This sequence is not a RefSeq sequence. You can see that because the accession number does not contain an underscore and all RefSeq accession numbers contain an underscore.

This brings us to a RefSeq record with accession number NC_000913.

This creates a file called sequence.fasta in the Downloads folder of your computer. Upload the downloaded file to Galaxy.


Mapping the reads

Open the Bowtie_1 aligner parameter form. Use the E.coli genome for mapping. Bowtie needs an input file containing the reads (in our case SRR576938.fastq). Bowtie can map single end reads like we have but also paired end reads. In the case of paired end reads you have two FASTQ files, one with the upstream reads and one with the downstream reads.

Bowtie has two modes of mapping. The simplest strategy is called v-mode alignment: you align complete reads (from the first to the last base aka end-to-end) to the reference and you count the number of mismatches in this alignment. In this mode quality values are ignored and you need to tell bowtie the maximum number of mismatches you allow.

1. Do a v-mode mapping allowing 2 mismatches in the alignments.

Remember because the base quality at the 3'end of the reads is lower, base calls at the 3'ends are often incorrect. This will inevitably lead to mismatches in the alignments. Reads with more than 2 mismatches will not be reported. To avoid losing too many reads during the mapping we can either trim low quality bases from the 3' ends of the reads before the alignment is done or use a mapping strategy that takes into account the quality scores of the bases.

This strategy is called n-mode alignment. It's the default mode. It aligns seeds, the first N bases of the reads at the high quality 5'end, to the reference. You have to set the length of the reads and the maximum number of mismatches allowed in the seed alignment. Additionally the sum of the quality scores at all mismatched positions (not just in the seed) is calculated and you can set a maximum for this parameter. In this way, reads with mismatches with high quality scores will not be reported whereas mismatches with low scores are more or less ignored.

The FASTQC report showed that the last base is of low quality. Since the reads are 36 bases ling we could use seeds of 35 bases for the mapping. Do an n-mode mapping with seeds of 35 bases allowing 2 mismatches in the seeds.

We also need to specify that we only want to report reads that map specifically to one location in the reference. By default, bowtie will include unmapped reads in the output file. That's unnecessary since no one uses these unmapped reads.

We want to get a rough idea of the quality of the mapping. Look at the stdout.txt file that was generated by bowtie to get the basic statistics of the mapping.

The output of Bowtie is a SAM file. The SAM format corresponds to large text files, that can be compressed ("zipped") into .bam files that take up to 4 times less disk space and are usually sorted and indexed for fast access to the data they contain. The index of a .bam file is named .bai and some tools require these index files to process the .bam files. So we need to transform the .sam file with our mapping results to a sorted .bam and .bai file. You can use one of the tools from the Picard toolbox for this.


Exercise 3: Sorting and indexing BAM/SAM files

The BAM files need to be sorted according to genomic location and indexed because some downstream tools cannot handle raw BAM files since they are so large and chaotic. At this point they are not, the reads are in the same order as they were in the FASTQ file, according to position on the flow cell.

Sort (and index) the mapping files so that they can be used as input for IGV and Qualimap. Sorting will add an index to the BAM file (this is the .bai file that is generated). Download the sorted BAM and BAI files to your computer.

Qualimap BAMQC analysis

Search for the Qualimap BamQC tool. Input is a BAM file in which the reads are sorted according to genomic location.

Exercise 4: Quality control of DNASeq mapping results

Note that Qualimap expects a coordinate sorted BAM as input.
Perform a BamQC analysis on the sorted BAM file.</p>


Look at the Coverage section of the report.

Look at the Mismatches and indels section of the report.

Look at the coverage across reference plot. The upper figure provides the coverage distribution (red line) and coverage deviation across the reference sequence. The coverage is measured in X. The lower figure shows GC content across reference (black line) together with its average value (red dotted line).

Look at the coverage histogram.



Exercise 5: Quality control of RNASeq mapping results

Perform a BamQC analysis on the BAM file.

Look at the Coverage section of the report.

Look at the ACGT Content section of the report. The GC percentage is around 49%. That's strange because the average GC content of the Arabidopsis thaliana genome has been reported as 36%.

Look at the Chromosome stats section.

Look at the genome fraction coverage plot.



Qualimap RNASEQC analysis

Search for the Qualimap RNASEQC tool. Input is a BAM file. The tool needs annotation in the form of a .gtf file.

Exercise 6: Qualimap quality control of RNASeq mapping results

Do RNASeq QC analysis on the SRR074262 bam file.

The sum of Intronic and Intergenic is equal to No feature assigned.

Note on RNASeQC analysis on human samples: Qualimap will generate a warning when you analyze human RNASeq data. The human genome sequence contains a lot of contigs with sequences that cannot be assembled. Not all of these sequences have been annotated so their ID does not appear in the human gtf file. As a result Qualimap throws a warning saying that you gave him sequences that do not appear in the gtf file. The warning does not hamper the analysis, Qualimap will still generate a report.


Exercise 7: RSeQC quality control of RNASeq mapping results

Run the following tools on the SRR074262 bam file that was generated in the fourth run of STAR.

1. bam_stat

2. read_distribution

3. junction saturation

Mapping in Galaxy

Links:


Mapping RNASeq reads in Galaxy

STAR has a large number of parameters, we'll give an overview of the most important ones:

  • Single end or paired end data: the parameters you have to set will adjust accordingly
  • RNASeq Fastq file: STAR automatically detects files it can use as input, select the file you want to map.
  • Custom or built-in reference genome: many reference genomes are built-in in Galaxy just select the correct organism from the list of reference genomes.
  • Length of the genomic sequence around annotated junctions: the default is 100 but the ideal value is read length-1.
  • Count number of reads per gene: map reads and create a count table (table with counts of how many reads map to each gene).
  • Would you like to set output parameters (formatting and filtering)?: in most cases yes because the default settings will most likely not be ideal for your data
  • Would you like to set additional output parameters (formatting and filtering)?: in most cases yes because the default settings will most likely not be ideal for your data
  • Would you like unmapped reads included in the SAM?: by default STAR does not save the unmapped reads, so if you want to analyze them (BLAST...) you need to change this setting.
  • Maximum number of alignments to output a read's alignment results, plus 1: default is 10 meaning that reads that map to more than 10 locations in the genome are excluded from the results. Multimappers are common when you map short reads. What to do with them is a complicated issue. You could use them to represent expression of whole classes/families of RNAs (e.g. transposons, gene families...). It can be useful to have two separate files: one for unique mappers and one for multimappers.
  • Maximum number of mismatches to output an alignment, plus 1: maximum number of mismatches for a read (single-end) or a pair of reads (paired-end). Default is 10. The value you should choose is dependent on the read length. For short quality trimmed reads you typically allow 5% mismatches.
  • Maximum ratio of mismatches to read length: how many mismatches you allow in the alignment (number is represented as a fraction of the total read length). Typically you choose 0.05 (= 5%) but this depends on the quality of the reads. In case of reads with many sequencing errors you need to increase the fraction of mismatches you allow.
  • Other parameters (seed, alignment, limits and chimeric alignment): choose extended parameter list because the default settings will most likely not be ideal for your data
  • Alignment parameters: Maximum intron size: maximum distance between reads from a pair when mapped to the genome.
  • Two-pass mode: Use two pass mode to better map reads to unknown splice junctions: for the most accurate mapping, you should run STAR in 2-pass mode. It allows to detect more reads mapping to novel splice junctions. The basic idea is to run STAR with standard parameters, then collect the junctions detected in this first pass, and use them as annotated junctions for the second pass mapping.
  • Parameters related to chimeric reads: chimeric reads occur when one read aligns to two distinct portions of the genome. In RNA-Seq chimeric reads may indicate the presence of chimeric genes. Many chimeric genes form through errors in DNA replication or DNA repair so that pieces of two different genes are combined. Chimeric genes can also occur when a retrotransposon accidentally copies the transcript of a gene and inserts it into the genome in a new location. Depending on where the new retrogene appears, it can produce a chimeric gene...

Click Execute to start the mapping.

STAR produces 3 result files:

  • bam file containing all alignments (multimappers, reads that map to multiple locations, are printed at each location)
  • tab file containing all detected splice junctions
  • log file containing mapping statistics


Mapping DNASeq reads in Galaxy

This is an overview of the main parameters:

  • Will you select a reference genome from your history or use a built-in index? Galaxy has many built-in genomes for Bowtie 1 but you can also use a fasta file from the history when the organism you work is not supported.
  • Is this library mate-paired? single end or paired end ?
  • FASTQ file Galaxy will automatically detect potential input files, select the file you want to use as input.
  • Bowtie settings to use ask for full parameter list since the defaults are most likely not ideal for your data
  • Trim n bases from high-quality (left) end of each read before alignment (-5) trim bases from high-quality (left) end of each read before alignment, default is 0.
  • Trim n bases from low-quality (right) end of each read before alignment (-3) trim bases from low-quality (right) end of each read before alignment, default is 0.
  • Alignment mode when the default -n option is used, bowtie determines which alignments are valid according to the following policy: alignments may have no more than n mismatches (where n is a number 0-3, set with Maximum number of mismatches permitted in the seed (-n)) in the first l bases (where l is a number 5 or greater, set with Seed length (-l)) on the high-quality (left) end of the read. The first l bases are called the "seed". The sum of the Phred quality scores at all mismatched positions (not just in the seed) may not exceed e (set with Maximum permitted total of quality values at all mismatched read positions (-e)).
    In -v mode, alignments may have no more than v mismatches, where v may be a number from 0 through 3 set using the Maximum number of mismatches (-v) option. Quality values are ignored.
  • Suppress all alignments for a read if more than n reportable alignments exist (-m) default is no limit. Bowtie is designed to be very fast for small -m but can become significantly slower for larger values of -m


Download mapping results from Galaxy

Click the name of the file containing the sorted alignments in the history.
Click the download button at the bottom of the description. You should download two files: the bam file containing the mapping results and an index file (.bai) for fast access to the bam file. In Galaxy, indexing of bam files is done automatically. You need to download both files into the same folder.


IGV2.png



Mapping via command line tools

On our Linux command line page you can find:
an exercise on mapping with Bowtie via the command line.

We will handle the mapping in detail in advanced NGS trainings, so we are not going into more detail now.


Visualisation of mapping results in IGV

IGV needs a sorted BAM file and an index (.bai) file.

  • Open IGV by clicking its icon on the Desktop. Be patient, it might take a few minutes for the program to start.
  • If necessary change the genome in IGV from Human hg19 to the one you used in the mapping.
    IGV3.png
  • Load the mapped reads via File in the top menu and Load from File.
    IGV4.png

    Select the .bam file to open. You don't need to load the .bai file, it's suffcient that it is present in the same folder as the .bam file.
  • This loads the data into the center view. At this point, you can't see the reads, you have to zoom in to view them.
  • To zoom in on a gene type its accession number in the top toolbar and clicking Go:
    IGV5.png
  • Zooming in can be done using the zoom bar in the top toolbar:
    IGV6.png

The reads are represented by grey arrows, the arrow indicating the orietation of the mapping. Hovering your mouse over a read gives additional info on the mapping. The colored nucleotides indicate mismatches between the read and the reference.

By default IGV calculates and displays the coverage track (red) for an alignment file. When IGV is zoomed to the alignment read visibility threshold (by default 30 KB), the coverage track displays the depth of the reads displayed at each locus as a gray bar chart. If a nucleotide differs from the reference sequence in greater than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base (A, C, G, T). You can view count details by hovering the mouse over a coverage bar:

IGV8.png

Exercise 8: Viewing the aligned DNA reads in IGV

Open IGV. Be patient, it might take a few minutes for the program to start.

Change the genome in IGV from Human hg19 to the one you used in the mapping.

You can also visualize the annotation (genes) in IGV. You can obtain a file with annotations from the Refseq record.

You can also download the GFF3 file from our website.

If you want to load the .gff3 file and visualize the annotation properly in IGV, it’s necessary to comment (or remove) the third line:

##sequence-region NC_000913.3 1 4641652
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=511145
## NC_000913.3	RefSeq	region	1	4641652	.	+	.	ID=NC_000913.3:1..4641652;Dbxref=taxon:511145;Is_cir...
NC_000913.3	RefSeq	gene	190	255	.	+	.	ID=gene-b0001;Dbxref=ASAP:ABE-0000006,ECOCYC:EG11277...

You can visualize reads in IGV as long as they are sorted according to genomic location. Download the two sorted and indexed bam files (for SRR576933 and SRR576938) from GenePattern to your computer and load them in IGV.

Zoom in on region NC_000913.3:821,334-821,380 and look at position 821,357 and scroll a bit down in the control sample. Hover your mouse over the read with a colored G at this position and a colored A a few bases downstream.


IGV17.png

Hover your mouse over the Coverage track in this position. When IGV is zoomed to the alignment read visibility threshold (by default, 30 KB), the coverage track displays the depth of the reads displayed at each position as a gray bar chart.


IGV27.png

By default, if a nucleotide differs from the reference sequence in greater than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base (A, C, G, T). In this case 21% of the reads have a G instead of an A in this position. Although 21% is over 20% the bar is not colored because the quality scores are taken into account and the quality scores of the G’s are rather low. However, when you override this default threshold by Setting the allele frequency threshold (right click the coverage bar) to a lower value, the bar becomes colored.


Exercise 9: Viewing the aligned RNA reads in IGV

  • Select the correct genome from the genome drop-down list on the upper-left of the IGV window.
  • Drag and drop the annotations (Gene track) to the top of the tracks section.
  • Zoom in until you see the Sequence track appearing at the bottom.
  • Drag and drop it beneath the Gene track.

  • IGVRNA1.png

  • Load the coordinate sorted SRR074262 bam file. Note that you need the accompanying bai file in the same folder.
  • Zoom in on AT1G02930
  • Zoom in until you see the nucleotides of the reference
  • Hover your mouse over a grey read, extra info becomes visible:
    MAPQ defines the mapping quality, this is either a very high number (255) meaning the read mapped uniquely in this position or a low number (0, 1, 2, 3) meaning the read is a multimapper (aligns to multiple locations). Which set of numbers is used to define multimappers depends on the mapper.
    Reference span specifies the genomic region the read is mapped to.
    The Cigar string will show deletions or insertions (reference has more or less nucleotides on this position than read) while the colors of the bases in the read show mismatches in the alignment (nucleotide in read differs from nucleotide in reference).
    Clipping specifies if any soft clipping (mapper ignores mismatches in the alignment at the ends of the reads) took place.
  • Hover your mouse over the coverage track and try to interpret the numbers that pop up. When IGV is zoomed to the alignment read visibility threshold (by default, 30 KB), the coverage track displays the depth of the reads displayed at each locus as a gray bar chart. If a nucleotide differs from the reference sequence in greater than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base (A, C, G, T).
  • There’s a lot of reads for AT1G02930, right click the bam track and squish the reads. In squished mode the gaps in the alignments are visible as blue lines (while the reads are grey). You can clearly see the blue zones that correspond to the introns (thin blue lines) in the annotation track.

  • IGVRNA2.png

  • Open the sorted SRR074285 bam file
  • Hover your mouse over the highest peak for AT1G02930 in the Coverage track

You can solve this by creating a Sashimi plot. It is more informative with respect to different expression and splicing.




References:
  1. http://bio-bwa.sourceforge.net/
  2. http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
  3. http://tophat.cbcb.umd.edu/
  4. http://bioinformatics.oxfordjournals.org/content/early/2012/10/25/bioinformatics.bts635