Parameters of TopHat
From BITS wiki
[ Main_Page | NGS data analysis | RNASeq analysis for differential expression in GenePattern ]
Basic parameters
- Bowtie index: Tophat is based on the same algorithm as Bowtie, a mapper that is used for mapping DNA reads. As a result both mappers can use the same index files. Select one of the built-in indexed genomes or upload your own genome and index it first using the Bowtie.indexer tool.
- GTF file: file containing the annotation of the genome. Select one of the built-in indexed genomes or upload your own gtf file. When you provide a gtf file TopHat will first extract the transcript sequences and align the reads to this virtual transcriptome. Only the reads that do not fully map to the transcriptome will be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final Tophat output.
- reads pair 1 and reads pair 2: input files in fastq format. For single-end sequencing you only provide reads pait 1. For paired-end sequencing you provide both files.
- library type: specify if you performed standard unstranded RNA-Seq or strand-specific RNA-Seq. Most people perform standard RNA-Seq.
- mate inner dist: the expected (mean) inner distance between reads of a pair when mapped to the transcriptome (so the occurrence of introns is not taken into account). For example, for paired end runs with fragments selected on average at 300bp, where each read is 50bp, you should set this parameter to 200 (average fragment length - 2*read length). The default is 50.
- mate std dev: the expected standard deviation for the distribution of inner distances between reads of a pair. For example, for paired end runs with fragments selected from 100bp to 500bp, you should set this parameter to 100 (max fragment length - min fragment length)/4. The default is 20.
Parameters Relevant when Transcriptome Search is Activated
As explained above, when you provide a gtf file Tophat first maps the reads against the transcriptome and then maps the reads that could not be mapped in this first step to the genome.
- transcriptome only: when you set this parameter to yes, TopHat will skip the second step and will only align the reads to the transcriptome. It will report those mappings as genomic mappings. Default is no.
- max transcriptome hits: maximum number of mappings allowed for a read, when aligned to the transcriptome. Any reads found with more than this number of mappings will be discarded.
- prefilter multihits: when mapping reads on the transcriptome, some repetitive or low complexity reads that would be discarded when mapped to the genome may appear to align to the transcriptome and thus may end up in the output file. This option directs TopHat to first map the reads to the genome in order to determine and exclude such reads.
Advanced Parameters with Values Tuned to Mammalian Genomes
- read mismatches: the total number of mismatches (no gaps !) in the read alignments is counted. When this number exceeds the threshold the alignment is discarded. The default is 2.
- read gap length: the total number of gaps in the read alignments is counted. When this number exceeds the threshold the alignment is discarded. The default is 2.
- read edit dist: the total number of edits (mismatches + gaps) in the read alignments is counted. When this number exceeds the threshold the alignment is discarded. The default is 2.
- min anchor length: TopHat will report junctions spanned by reads with at least this many bases on each side of the junction. When you have multiple reads spanning the same junction you should have at least one read with this many bases on each side. This must be at least 3 and the default is 8.
- min intron length: minimum intron length. TopHat will ignore spliced reads closer than this many bases apart. The default is 70, which is fine for mammalian genomes but not for plants, yeast... For instance, Arabidopsis has small introns so this parameter should be set to 25. See this article for an overview of intron sizes in various organisms.
- max intron length: maximum intron length. TopHat will ignore spliced reads further than this many bases apart. The default is 500000, which is fine for mammalian genomes but not for plants, yeast... For instance, Arabidopsis has small introns so this parameter should be set to 3000. See this article for an overview of intron sizes in various organisms.
- max insertion length: maximum length of a gap in the alignment (gap in the reference sequence). The default is 3.
- max deletion length: maximum length of a gap in the alignment (gap in the read). The default is 3.