Parameters of STAR

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS data analysis | RNASeq analysis for differential expression in GenePattern ]

STAR map reads to a reference genome sequence.
To obtain a description of its parameters and their default values click the Documentation link at the top of the page.


A bit more explanation about some of the parameters:

mapping and reporting of mapped reads

  • max number mismatches: maximum number of mismatches for a read (single-end) or a pair of reads (paired-end). Default is 10. The value you should choose is dependent on the read length. For short quality trimmed reads you typically allow 5% mismatches.
  • mates max gap: maximum distance between reads from a pair when mapped to the genome. If reads map to the genome farther apart the fragment is considered to be chimeric. The default value of 500000 is fine-tuned to mammalian genomes, for plant and yeast genomes you will have to decrease it. For Tophat this corresponds roughly to the max fragments length, this is because Tophat maps the reads to the transcriptome and introns are not taken into consideration when calculating the mates gap. STAR maps the reads to the genome, this is why the max distance between reads of a pair is equal to the intron size. For organisms with small introns you should take intron size + max fragment length.
  • max multimapping: related to reads that map on multiple locations on the genome. If reads map to more locations they are excluded from the results. Default is 10. Multimapping reads are common when you map short reads. What to do with multimappers is a complicated issue. You could use them to represent expression of whole classes/families of RNA (repeats (e.g. transposons), gene families etc). It is useful to have two separate files: one for unique mappers and one for multimappers.
  • min report canonical junction overhang: If you specify a gft file, STAR will extract splice junctions from this file and use them to greatly improve accuracy of the mapping. STAR not only maps reads using annotated splice junctions but it can also detect novel splice sites based on the sequence characteristics of annotated junctions. For instance, the major spliceosome splices introns containing GT at the 5' splice site and AG at the 3' splice site. This type of splicing is called canonical splicing, which accounts for more than 99% of splicing. By contrast, when the intronic flanking sequences do not follow the GT-AG rule, noncanonical splicing is said to occur.
  • map only reported junctions: the prediction of novel splice junctions by STAR is quite complicated. Simplified: if you set this parameter to yes STAR will only report high quality mapping with respect to splice junctions: unspliced reads, reads that map to annotated splice sites and reads that map to predicted splice sites with high confidence.

postprocessing and supplementary output

  • two-pass: For the most sensitive novel junction discovery, you should run STAR in 2-pass mode. It does not increase the number of novel junctions, but allows to detect more reads mapping to the novel junctions. The basic idea is to run STAR with the standard parameters, then collect the junctions detected in this first pass, and use them as annotated junctions for the second pass mapping.
  • output unmapped reads: by default STAR does not save the unmapped reads, so if you want to analyze them (BLAST...) you need to change this setting.
  • detect chimeric reads: chimeric reads occur when one sequencing read aligns to two distinct portions of the genome. In RNA-seq chimeric reads may indicate the presence of chimeric genes. Chimeric genes form through the combination of portions of one or more coding sequences. Many chimeric genes form through errors in DNA replication or DNA repair so that pieces of two different genes are combined. Chimeric genes can also when a retrotransposon accidentally copies the transcript of a gene and inserts it into the genome in a new location. Depending on where the new retrogene appears, it can produce a chimeric gene...
  • quantify genes: map reads and create a count table (table with counts of how many reads map to each gene).