Mapping of NGS data

[ Main_Page | NGS data analysis | Improving the quality of NGS data | Linux command line ]

After quality control, the next step is to align the reads to a reference sequence.

The reference is in most cases the full genome sequence but sometimes, a library of EST sequences is used. In either way, aligning your reads to a reference sequence is called mapping.

The most used mappers are BWA^[1] and Bowtie^[2] for DNASeq data and Tophat^[3], STAR^[4], or HISAT2 for RNASeq data.

Mappers differ in methodology, parameters, how fast and how accurate they are and whether they tolerate spliced alignments or not (relevant for RNA-Seq). Bowtie is faster than BWA, but looses some sensitivity (does not map an equal amount of reads to the correct position in the genome as BWA). BWA and Bowtie cannot align spliced reads while Tophat, STAR and HISAT2 can.

At the moment STAR is the most popular RNASeq mapper and HISAT2 is being pushed over TopHat.

Mapping

Links:

paper on intron sizes in various organisms

1. Run STAR with default parameters (single pass) on the trimmed and groomed SRR074262 file.
Don't forget to use the built-in indexed Arabidopsis thaliana TAIR10 genome. View the log file to get an overview of the mapping results.

2. Run STAR with default parameters on the trimmed and groomed SRR074262 file but set maximum intron length to 3000 (it's Arabidopsis data). When you map RNASeq reads to the genome the alignments will have gaps (introns). This parameter specifies the maximum size of these gaps.
You can find a detailed description of the STAR parameters on this page

Why less multimappers than in the first run?
By decreasing the intron size (and thus the allowed gap size) the number of junctions that reads can map to is decreased. In other words you decrease the reference sequence space the reads can map to: less possible mapping locations means less multimappers.

3. Run STAR with default parameters on the trimmed and groomed SRR074262 file but set maximum intron length to 3000 and maximum number of mismatches to 2. This is the number of mismatches you allow in the alignment. Typically you set it to 5% of the read length but the choice depends on the quality of the reads.

Why more unmapped reads than in the second run?
You will discard more alignments than in the second run, more specifically the alignments with more than 2 mismatches.

4. Run STAR with default parameters on the trimmed and groomed SRR074262 file but set maximum intron length to 3000 and maximum number of mismatches to 2. Use two pass mode: two-pass mode gives a much more accurate mapping than single-pass mode.

Exercise 2: Mapping DNASeq reads

Exercise created by Morgane Thomas Chollier

Links:

Map the reads of the groomed and trimmed control sample with Bowtie to the E.coli genome.

Obtaining the reference genome

In the ChIP-Seq experiment of E. coli we want to see which genomic regions are bound to transcription factor FNR by comparing the reads from the ChIP sample to those of the control sample. However, at this point what we have is a set of reads that are identified by their location on the flow cell. To answer our question we should link the reads to regions in the genome to obtain their genomic coordinates. This process is called mapping.

For Illumina reads the standard mappers are BWA and Bowtie (version 1 and 2).

Which version of Bowtie are we going to use?
We will use Bowtie version 1 as this version was designed for mapping short reads (< 50nt) and our reads are short (36nt).

Check the manual on the Bowtie website.

Bowtie needs the complete genome, in FASTA format as a reference sequence to align the reads to.

Which E. coli strain was used in the experiment ?
Go to the paper and check the part Strains and growth conditions in the Materials and methods section. There you see that the experiment was done using E. coli K-12 MG1655.

The genome sequence of E. coli K-12 MG1655 needs to be in a specific format (=index) for bowtie. Several pre-built indexes are available to download on the bowtie webpages or the iGenomes website.

Although the E. coli sequence is available we will not use it to show you how you should proceed if you don't find your reference sequence on this website. In that case you will need to make the index yourself.

If you can't find your reference on the iGenomes website you have to download it from:

Since Ensembl focuses on higher eukaryotes, we are going to download the genome from NCBI.

Which reference sequence was used in the experiment ?
Go to the paper and check the part High-throughput RNA sequencing (RNA-seq) analysis. There you see that the reads were mapped to an NCBI sequence with accession number U00096.

Search for this sequence on NCBI ?
Go to the NCBI website, select the Nucleotide database, type U00096 as a search term and click Search.

NCBI Nucleotide is notorious for the amount of errors it contains both in sequences and in annotations. Therefore, if available you should always use sequences from RefSeq, the clean subset of NCBI’s Nucleotide database. This sequence is not a RefSeq sequence. You can see that because the accession number does not contain an underscore and all RefSeq accession numbers contain an underscore.

Is there a RefSeq sequence available ?
In the Nucleotide record, scroll down to the Related information section in the right menu. There you see that a RefSeq sequence is available. Click the Identical RefSeq link.

This brings us to a RefSeq record with accession number NC_000913.

Download the sequence of the RefSeq record in FASTA format
In the record expand the Send to section (red). Select File as destination (green). This means that you download the data on your computer. Select FASTA format (blue). Click Create File.

This creates a file called sequence.fasta in the Downloads folder of your computer. Upload the downloaded file to Galaxy.

Mapping the reads

Open the Bowtie_1 aligner parameter form. Use the E.coli genome for mapping. Bowtie needs an input file containing the reads (in our case SRR576938.fastq). Bowtie can map single end reads like we have but also paired end reads. In the case of paired end reads you have two FASTQ files, one with the upstream reads and one with the downstream reads.

Bowtie has two modes of mapping. The simplest strategy is called v-mode alignment: you align complete reads (from the first to the last base aka end-to-end) to the reference and you count the number of mismatches in this alignment. In this mode quality values are ignored and you need to tell bowtie the maximum number of mismatches you allow.

1. Do a v-mode mapping allowing 2 mismatches in the alignments.

Remember because the base quality at the 3'end of the reads is lower, base calls at the 3'ends are often incorrect. This will inevitably lead to mismatches in the alignments. Reads with more than 2 mismatches will not be reported. To avoid losing too many reads during the mapping we can either trim low quality bases from the 3' ends of the reads before the alignment is done or use a mapping strategy that takes into account the quality scores of the bases.

This strategy is called n-mode alignment. It's the default mode. It aligns seeds, the first N bases of the reads at the high quality 5'end, to the reference. You have to set the length of the reads and the maximum number of mismatches allowed in the seed alignment. Additionally the sum of the quality scores at all mismatched positions (not just in the seed) is calculated and you can set a maximum for this parameter. In this way, reads with mismatches with high quality scores will not be reported whereas mismatches with low scores are more or less ignored.

The FASTQC report showed that the last base is of low quality. Since the reads are 36 bases ling we could use seeds of 35 bases for the mapping. Do an n-mode mapping with seeds of 35 bases allowing 2 mismatches in the seeds.

We also need to specify that we only want to report reads that map specifically to one location in the reference. By default, bowtie will include unmapped reads in the output file. That's unnecessary since no one uses these unmapped reads.

We want to get a rough idea of the quality of the mapping. Look at the stdout.txt file that was generated by bowtie to get the basic statistics of the mapping.

The output of Bowtie is a SAM file. The SAM format corresponds to large text files, that can be compressed ("zipped") into .bam files that take up to 4 times less disk space and are usually sorted and indexed for fast access to the data they contain. The index of a .bam file is named .bai and some tools require these index files to process the .bam files. So we need to transform the .sam file with our mapping results to a sorted .bam and .bai file. You can use one of the tools from the Picard toolbox for this.

Exercise 3: Sorting and indexing BAM/SAM files

The BAM files need to be sorted according to genomic location and indexed because some downstream tools cannot handle raw BAM files since they are so large and chaotic. At this point they are not, the reads are in the same order as they were in the FASTQ file, according to position on the flow cell.

Which Picard tool can we use to sort the files?
You can use Picard.SortSam for this.

Sort (and index) the mapping files so that they can be used as input for IGV and Qualimap. Sorting will add an index to the BAM file (this is the .bai file that is generated). Download the sorted BAM and BAI files to your computer.

Qualimap BAMQC analysis

Search for the Qualimap BamQC tool. Input is a BAM file in which the reads are sorted according to genomic location.

Exercise 4: Quality control of DNASeq mapping results

Note that Qualimap expects a coordinate sorted BAM as input.
Perform a BamQC analysis on the sorted BAM file.</p>

How many reads were clipped during the mapping? Why?
There are no clipped reads: that’s normal because Bowtie does not do any soft clipping.

Does the duplication rate in Qualimap correspond to the duplication rate in FASTQC? Why (not)?
This is different from what we saw in FASTQC but that’s because both tools have different definitions of duplicates. For FASTQC duplicates are reads with the same sequence, for Qualimap duplicates are reads for which the alignment starts at exactly the same position on the reference sequence.

Why is the duplication rate so high?
High duplication rates are linked to the fact that the E. coli genome is very small.

Look at the Coverage section of the report.

How to interpret this mean coverage?
This means that each base in the genome is represented on average 49 times in the reads: this is high but normal since it’s the control sample (genomic DNA cut into pieces so reads come from all over the genome) and the E. coli genome is small.

Look at the Mismatches and indels section of the report.

How is the error rate calculated?
It's total number of errors (mismatches + indels) / number of mapped bases. The total number of mismatches and the total number of indels are computed from the CIGAR values (optional fields in the bam file: NM – number of mismatches, MD - number of deletions). The error rate is not reported if these fields are missing in the bam file.

Look at the coverage across reference plot. The upper figure provides the coverage distribution (red line) and coverage deviation across the reference sequence. The coverage is measured in X. The lower figure shows GC content across reference (black line) together with its average value (red dotted line).

Why is there a link between GC-content and coverage?
Coverage of GC-rich regions tends to be lower because sequencing GC-rich sequences is extremely hard. These regions are hard to amplify because of bad denaturation.

Look at the coverage histogram.

Is the coverage uniform?
It’s a nice normal distribution so the coverage is indeed OK: you always have some variation.

What's on the X-axis of this plot?
On the X-axis there’s the sequencing depth. For each base in the reference the number of reads that cover it is calculated = sequencing depth or coverage. These depths are binned and displayed on the X-axis.

What's on the Y-axis of this plot?
The Y-axis shows the total number of bases in the reference that have a certain sequencing depth.

How's the quality of the mapping across the reference?
The Mapping quality across reference plot shows that mapping quality is consistently very high over the entire genome, almost all reads have a MAPQ = 255 (unique mappers).

Exercise 5: Quality control of RNASeq mapping results

Perform a BamQC analysis on the BAM file.

Look at the Coverage section of the report.

How to interpret this mean coverage?
This means that each base in the genome is represented on average 2 times in the reads. This might seem a very low number but the mean coverage depth is calculated against the full genome sequence (size of the reference is given in the bam file) so all intergenics and introns with zero coverage are included in these calculations. If you want to avoid this you should provide a GTF file with exon annotation to limit the analysis to the annotated exons. On the other hand, you will then lose all information on reads that map in intergenics and introns.

Look at the ACGT Content section of the report. The GC percentage is around 49%. That's strange because the average GC content of the Arabidopsis thaliana genome has been reported as 36%.

Why is the GC content so high?
The average GC content of the Arabidopsis thaliana genome is very low due to large intronic regions which contain large stretches of A-T bases. We mapped RNASeq data so our reads do not contain these intronic regions.

Look at the Chromosome stats section.

Why is the coverage on the plastid genome so high?
This was expected because plastid genes are densely packed and plastid genomes contain few intergenic regions, while mitochondrial genes are widely dispersed. Hence the low coverage in the mitochondrial (Mt) genome. Again coverage values are calculated based on the full length of the chromosomes, intergenics and introns included.

Look at the genome fraction coverage plot.

How much of the genome is covered at least once?
In total 13% of the genome is covered at least once, which roughly corresponds to estimates that around 10% of the genome is protein-coding. Coverage at intergenics and introns will be 0.

Qualimap RNASEQC analysis

Search for the Qualimap RNASEQC tool. Input is a BAM file. The tool needs annotation in the form of a .gtf file.

Exercise 6: Qualimap quality control of RNASeq mapping results

Do RNASeq QC analysis on the SRR074262 bam file.

Why is there a difference between the number of mapped reads and the number of alignments?
When calculating the total number of mapped reads, reads that map to multiple locations (= secondary alignments) are only counted once so if you add the number of mapped reads and the number of secondary alignments you get the total number of reads.

The sum of Intronic and Intergenic is equal to No feature assigned.

Note on RNASeQC analysis on human samples: Qualimap will generate a warning when you analyze human RNASeq data. The human genome sequence contains a lot of contigs with sequences that cannot be assembled. Not all of these sequences have been annotated so their ID does not appear in the human gtf file. As a result Qualimap throws a warning saying that you gave him sequences that do not appear in the gtf file. The warning does not hamper the analysis, Qualimap will still generate a report.

Exercise 7: RSeQC quality control of RNASeq mapping results

Run the following tools on the SRR074262 bam file that was generated in the fourth run of STAR.

1. bam_stat

2. read_distribution

3. junction saturation

Mapping in Galaxy

Links:

Mapping RNASeq reads in Galaxy

STAR has a large number of parameters, we'll give an overview of the most important ones:

Single end or paired end data: the parameters you have to set will adjust accordingly
RNASeq Fastq file: STAR automatically detects files it can use as input, select the file you want to map.
Custom or built-in reference genome: many reference genomes are built-in in Galaxy just select the correct organism from the list of reference genomes.
Length of the genomic sequence around annotated junctions: the default is 100 but the ideal value is read length-1.
Count number of reads per gene: map reads and create a count table (table with counts of how many reads map to each gene).
Would you like to set output parameters (formatting and filtering)?: in most cases yes because the default settings will most likely not be ideal for your data
Would you like to set additional output parameters (formatting and filtering)?: in most cases yes because the default settings will most likely not be ideal for your data
Would you like unmapped reads included in the SAM?: by default STAR does not save the unmapped reads, so if you want to analyze them (BLAST...) you need to change this setting.
Maximum number of alignments to output a read's alignment results, plus 1: default is 10 meaning that reads that map to more than 10 locations in the genome are excluded from the results. Multimappers are common when you map short reads. What to do with them is a complicated issue. You could use them to represent expression of whole classes/families of RNAs (e.g. transposons, gene families...). It can be useful to have two separate files: one for unique mappers and one for multimappers.
Maximum number of mismatches to output an alignment, plus 1: maximum number of mismatches for a read (single-end) or a pair of reads (paired-end). Default is 10. The value you should choose is dependent on the read length. For short quality trimmed reads you typically allow 5% mismatches.
Maximum ratio of mismatches to read length: how many mismatches you allow in the alignment (number is represented as a fraction of the total read length). Typically you choose 0.05 (= 5%) but this depends on the quality of the reads. In case of reads with many sequencing errors you need to increase the fraction of mismatches you allow.
Other parameters (seed, alignment, limits and chimeric alignment): choose extended parameter list because the default settings will most likely not be ideal for your data
Alignment parameters: Maximum intron size: maximum distance between reads from a pair when mapped to the genome.
Two-pass mode: Use two pass mode to better map reads to unknown splice junctions: for the most accurate mapping, you should run STAR in 2-pass mode. It allows to detect more reads mapping to novel splice junctions. The basic idea is to run STAR with standard parameters, then collect the junctions detected in this first pass, and use them as annotated junctions for the second pass mapping.
Parameters related to chimeric reads: chimeric reads occur when one read aligns to two distinct portions of the genome. In RNA-Seq chimeric reads may indicate the presence of chimeric genes. Many chimeric genes form through errors in DNA replication or DNA repair so that pieces of two different genes are combined. Chimeric genes can also occur when a retrotransposon accidentally copies the transcript of a gene and inserts it into the genome in a new location. Depending on where the new retrogene appears, it can produce a chimeric gene...

Click Execute to start the mapping.

STAR produces 3 result files:

bam file containing all alignments (multimappers, reads that map to multiple locations, are printed at each location)
tab file containing all detected splice junctions
log file containing mapping statistics

Mapping DNASeq reads in Galaxy

This is an overview of the main parameters:

Will you select a reference genome from your history or use a built-in index? Galaxy has many built-in genomes for Bowtie 1 but you can also use a fasta file from the history when the organism you work is not supported.
Is this library mate-paired? single end or paired end ?
FASTQ file Galaxy will automatically detect potential input files, select the file you want to use as input.
Bowtie settings to use ask for full parameter list since the defaults are most likely not ideal for your data
Trim n bases from high-quality (left) end of each read before alignment (-5) trim bases from high-quality (left) end of each read before alignment, default is 0.
Trim n bases from low-quality (right) end of each read before alignment (-3) trim bases from low-quality (right) end of each read before alignment, default is 0.
Alignment mode when the default -n option is used, bowtie determines which alignments are valid according to the following policy: alignments may have no more than n mismatches (where n is a number 0-3, set with Maximum number of mismatches permitted in the seed (-n)) in the first l bases (where l is a number 5 or greater, set with Seed length (-l)) on the high-quality (left) end of the read. The first l bases are called the "seed". The sum of the Phred quality scores at all mismatched positions (not just in the seed) may not exceed e (set with Maximum permitted total of quality values at all mismatched read positions (-e)).
In -v mode, alignments may have no more than v mismatches, where v may be a number from 0 through 3 set using the Maximum number of mismatches (-v) option. Quality values are ignored.
Suppress all alignments for a read if more than n reportable alignments exist (-m) default is no limit. Bowtie is designed to be very fast for small -m but can become significantly slower for larger values of -m

Download mapping results from Galaxy

Click the name of the file containing the sorted alignments in the history.
Click the download button at the bottom of the description. You should download two files: the bam file containing the mapping results and an index file (.bai) for fast access to the bam file. In Galaxy, indexing of bam files is done automatically. You need to download both files into the same folder.

Mapping via command line tools

On our Linux command line page you can find:
an exercise on mapping with Bowtie via the command line.

We will handle the mapping in detail in advanced NGS trainings, so we are not going into more detail now.

Visualisation of mapping results in IGV

IGV needs a sorted BAM file and an index (.bai) file.

Open IGV by clicking its icon on the Desktop. Be patient, it might take a few minutes for the program to start.
If necessary change the genome in IGV from Human hg19 to the one you used in the mapping.
Load the mapped reads via File in the top menu and Load from File.

Select the .bam file to open. You don't need to load the .bai file, it's suffcient that it is present in the same folder as the .bam file.
This loads the data into the center view. At this point, you can't see the reads, you have to zoom in to view them.
To zoom in on a gene type its accession number in the top toolbar and clicking Go:
Zooming in can be done using the zoom bar in the top toolbar:

The reads are represented by grey arrows, the arrow indicating the orietation of the mapping. Hovering your mouse over a read gives additional info on the mapping. The colored nucleotides indicate mismatches between the read and the reference.

By default IGV calculates and displays the coverage track (red) for an alignment file. When IGV is zoomed to the alignment read visibility threshold (by default 30 KB), the coverage track displays the depth of the reads displayed at each locus as a gray bar chart. If a nucleotide differs from the reference sequence in greater than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base (A, C, G, T). You can view count details by hovering the mouse over a coverage bar:

Exercise 8: Viewing the aligned DNA reads in IGV

Open IGV. Be patient, it might take a few minutes for the program to start.

Change the genome in IGV from Human hg19 to the one you used in the mapping.

Load the desired genomed.
Load the E. coli genome as reference (from the file Escherichia_coli_K_12_MG1655.fasta, downloaded to build the bowtie index). Top menu: Genome -> Load Genome from File The loaded genome appears in the top left panel:

You can also visualize the annotation (genes) in IGV. You can obtain a file with annotations from the Refseq record.

Download the annotations from RefSeq in GFF3 format.
Go to the RefSeq record of the E. coli genome. Expand the Send to section at the top of the page. Choose File as destination. Select GFF3 format.

You can also download the GFF3 file from our website.

If you want to load the .gff3 file and visualize the annotation properly in IGV, it’s necessary to comment (or remove) the third line:

##sequence-region NC_000913.3 1 4641652
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=511145
## NC_000913.3	RefSeq	region	1	4641652	.	+	.	ID=NC_000913.3:1..4641652;Dbxref=taxon:511145;Is_cir...
NC_000913.3	RefSeq	gene	190	255	.	+	.	ID=gene-b0001;Dbxref=ASAP:ABE-0000006,ECOCYC:EG11277...

You can visualize reads in IGV as long as they are sorted according to genomic location. Download the two sorted and indexed bam files (for SRR576933 and SRR576938) from GenePattern to your computer and load them in IGV.

Load the annotation and the bam files of the ChIP and the control sample.
Top menu: File -> Load from File: You should see the track now. Do the same for the .bam files. Note that you have to download the .bai files too and store them in the same folder as the .bam files. You do not have to explicitly open the .bai files in IGV but they have to be in the same folder as the .bam files or IGV will throw an error. Zoom in u8ntil you see the reads

Zoom in on region NC_000913.3:821,334-821,380 and look at position 821,357 and scroll a bit down in the control sample. Hover your mouse over the read with a colored G at this position and a colored A a few bases downstream.

What’s the mapping quality of this read?
The mapping quality is the number that follows MAPQ, in this case it's 255.

What does this quality score indicate?
It's the value Bowtie gives to uniquely mapped reads. It can also be a low number (0, 1, 2, 3) meaning the read is a multimapper: it also maps to other locations in the genome with an equal score or it maps to other locations but with lower scores. Which set of numbers is used to define multimappers depends on the mapper. Multimappers are colored white with light gray borders.

Which region is it mapped to?
Reference span specifies the genomic region the read is mapped to.

How many mismatches are in the alignment?
The Cigar string, in this case 36M, will show deletions or insertions (in this case no deletions or insertions) meaning that the reference has more or less nucleotides on this position than the read, while NM defines the number of mismatches (in this case 2). The colors of the bases in the read also show mismatches in the alignment meaning that the nucleotide in the read differs from the nucleotide on this position in the reference.

Hover your mouse over the Coverage track in this position. When IGV is zoomed to the alignment read visibility threshold (by default, 30 KB), the coverage track displays the depth of the reads displayed at each position as a gray bar chart.

What is the total number of reads in the control that map to this position?
Total count indicates that 38 reads map to this position.

How many reads in the control have a G in this position?
G: indicates that there are 8 reads with a G in this position.

By default, if a nucleotide differs from the reference sequence in greater than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base (A, C, G, T). In this case 21% of the reads have a G instead of an A in this position. Although 21% is over 20% the bar is not colored because the quality scores are taken into account and the quality scores of the G’s are rather low. However, when you override this default threshold by Setting the allele frequency threshold (right click the coverage bar) to a lower value, the bar becomes colored.

What's the coverage of the ChIP read in this position?
Total count indicates it's 14.

Exercise 9: Viewing the aligned RNA reads in IGV

Select the correct genome from the genome drop-down list on the upper-left of the IGV window.
Drag and drop the annotations (Gene track) to the top of the tracks section.
Zoom in until you see the Sequence track appearing at the bottom.
Drag and drop it beneath the Gene track.

Load the coordinate sorted SRR074262 bam file. Note that you need the accompanying bai file in the same folder.
Zoom in on AT1G02930
Zoom in until you see the nucleotides of the reference
Hover your mouse over a grey read, extra info becomes visible:
MAPQ defines the mapping quality, this is either a very high number (255) meaning the read mapped uniquely in this position or a low number (0, 1, 2, 3) meaning the read is a multimapper (aligns to multiple locations). Which set of numbers is used to define multimappers depends on the mapper.
Reference span specifies the genomic region the read is mapped to.
The Cigar string will show deletions or insertions (reference has more or less nucleotides on this position than read) while the colors of the bases in the read show mismatches in the alignment (nucleotide in read differs from nucleotide in reference).
Clipping specifies if any soft clipping (mapper ignores mismatches in the alignment at the ends of the reads) took place.
Hover your mouse over the coverage track and try to interpret the numbers that pop up. When IGV is zoomed to the alignment read visibility threshold (by default, 30 KB), the coverage track displays the depth of the reads displayed at each locus as a gray bar chart. If a nucleotide differs from the reference sequence in greater than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base (A, C, G, T).
There’s a lot of reads for AT1G02930, right click the bam track and squish the reads. In squished mode the gaps in the alignments are visible as blue lines (while the reads are grey). You can clearly see the blue zones that correspond to the introns (thin blue lines) in the annotation track.

Open the sorted SRR074285 bam file
Hover your mouse over the highest peak for AT1G02930 in the Coverage track

Do you think AT1G02930 is differentially expressed?
You can load multiple samples in IGV and compare them. AT1G02930 is differentially expressed but the peaks look the same height in the Coverage track. It's only when you hover your mouse over them you see that coverage is different.

You can solve this by creating a Sashimi plot. It is more informative with respect to different expression and splicing.

Create a Sashimi plot for this region

References:

[1] ttp://bio-bwa.sourceforge.net/

[2] ttp://bowtie-bio.sourceforge.net/bowtie2/index.shtml

[3] ttp://tophat.cbcb.umd.edu/

[4] ttp://bioinformatics.oxfordjournals.org/content/early/2012/10/25/bioinformatics.bts635

[1]

[2]

[3]

[4]

Mapping of NGS data

Contents

Mapping

Exercise 2: Mapping DNASeq reads

Obtaining the reference genome

Mapping the reads

Exercise 3: Sorting and indexing BAM/SAM files

Qualimap BAMQC analysis

Exercise 4: Quality control of DNASeq mapping results

Exercise 5: Quality control of RNASeq mapping results

Qualimap RNASEQC analysis

Exercise 6: Qualimap quality control of RNASeq mapping results

Exercise 7: RSeQC quality control of RNASeq mapping results

Mapping in Galaxy

Mapping RNASeq reads in Galaxy

Mapping DNASeq reads in Galaxy

Download mapping results from Galaxy

Mapping via command line tools

Visualisation of mapping results in IGV

Exercise 8: Viewing the aligned DNA reads in IGV

Exercise 9: Viewing the aligned RNA reads in IGV

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox