Tophat

From BITS wiki
Jump to: navigation, search

Tophat aligns RNA-seq data

Tophat is a mapper for aligning RNA-seq data to a reference genome (indexed with bowtie2), with the possibility to detect novel isoforms. It is optional to provide your own annotation files, or let Tophat detect novel isoforms by itself. For other aligner for RNA-seq, see http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Aligning_of_RNA-seq_reads

Tophat is the first step of the 'Tuxedo' suite of RNA-seq analysis tools for differential expression. The second step is cufflinks, which assembles an annotation file (.gtf) of the detected isoforms. The third step is cuffmerge, which merges all annotation of different mappings together. The fourth step, cuffdiff, takes this merged annotation file together with the mappings to estimate differential expression between the conditions.

See this figure for an overview of the Tuxedo suite: Tuxedooverview.png

Command examples

# a typical tophat command.

tophat -p 8 --no-coverage-search -o name_thout --transcriptome-index mm9.ensgene reads fq

In this example, 8 threads max are attributed to the job, coverage-search is avoided (speed), the file used to identify transcriptome is set to the ensemble gene model (other options are refGene of knownGene), finally, the read file is provided as input.

The result will be a folder named 'name_thout' and containing several files describing the result of mapping the reads to the ensgene exome.

-rwxrwx---. 1 root vboxsf 809M Oct 24 17:21 accepted_hits.bam
-rwxrwx---. 1 root vboxsf 652K Oct 24 17:14 deletions.bed
-rwxrwx---. 1 root vboxsf 329K Oct 24 17:14 insertions.bed
-rwxrwx---. 1 root vboxsf 8.8M Oct 24 17:14 junctions.bed
drwxrwx---. 1 root vboxsf 4.0K Oct 24 17:14 logs
-rwxrwx---. 1 root vboxsf   70 Oct 24 09:48 prep_reads.info
-rwxrwx---. 1 root vboxsf  17M Oct 24 17:14 unmapped.bam

Each group of reads is processed with tophat separately. This can take quite some time (up to ~10h on the BITS cluster for 15million reads).

The 'accepted_hits.bam' file is the input for the next processing step using cufflinks or other RNASeq quantification software (eg DESeq ...).