Parameters of Trimmomatic

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS_data_analysis ]

Description of parameters and their default values:

Basic input parameters

  • Input files: why two input files ? If you have paired-end reads you can submit both files of a pair simultaneously. If you have single-end reads you just submit one input file .
  • phred encoding: the encoding of the input file(s). The encoding is important because it determines the offset of the quality scores (ASCII offset 33 or ASCII offset 64). The default is 33. If you're not sure you can check the encoding of your file in the FastQC report (take into account that FastQC sporadically makes the wrong guess).
    GP9.png

Adapter clipping parameters: settings for trimming of adapter sequences from the reads

  • adapter clip sequence file: GenePattern has a number of built-in adapter sequences. You can select one of these by clicking the blue text area:
    GP20.png

    If you want to trim a sequence that is not on this list you can submit a fasta file containing the sequences you want to remove. See an example of such a fasta file. Sequences and names of contaminating adapters can be found in the Overrepresented sequences module of the FastQC report.


    GP21.png

Trimmomatic aligns adapter sequences to your reads in the same way as BLAST does:

  • 1. The adapter is cut into overlapping pieces of 16 bp and they are aligned to the reads. If this short alignment, known as the seed, is a perfect or almost perfect match, the entire alignment between the read and the full adapter is scored
  • 2. Each perfectly matching base adds 0.6 to the score, so a score of 15 requires a perfect 25 base match. Each mismatching base reduces the score by the Q/10 value of that base. So it takes 6 additional matching bases to overcome one high quality mismatch
  • 3. If the score exceeds a certain threshold, the adapter is clipped.

For adapter clipping you need to set the following parameters:

  • adapter clip seed mismatches: the maximum number of mismatches that you allow in the seeds. Recommended value is 2. You have to specify this parameter if you want the clipping to succeed.
  • adapter clip simple clip threshold: the minimum score of the full alignment between adapter and read for the clipping to take place. Values between 7 and 15 are recommended but this depends on the length of the reads. Since these thresholds correspond to 12 and 25 perfect matches respectively, setting it high for short reads will remove adapter dimers but adapter contamination will be undetected. You need to specify this parameter for the clipping to proceed.

For paired-end data the basic clipping strategy is adjusted to make use of the fact that if one read of a pair contains adapter contamination, the other read will have the same problem since they originate from the same too short fragment. So both reads will have adapter contamination at the same position and the two non-adapter parts of the reads will be reverse complements. Trimmomatic uses these extra pieces of information to detect small pieces of contaminating adapters in paired end data. This approach is called palindrome mode (see slides).

  • adapter clip palindrome clip threshold: the minimum score of the full alignment between two adapter contaminated reads in palindrome mode: values around 30 or more are recommended (corresponds to 50 matches or more). Even if you have single-end data you need to specify this parameter for the clipping to proceed.
  • adapter clip min length: minimum length for detected adapter in palindrome mode. Default is 8 bases. Since palindrome mode has a very low false positive rate, this can be safely reduced, even down to 1 to allow shorter adapter fragments to be removed. Even if you have single end data, you need to specify this parameter for the clipping to proceed.

Trim leading parameter: settings for trimming of low quality bases from the 5' ends of the reads

Trim trailing parameter: settings for trimming of low quality bases from the 3' ends of the reads

Sliding window parameters: settings for trimming of low quality (sequences in the) reads

This step trims sequences from reads once the average quality within the window fals below the threshold. By considering multiple bases using a window, a single poor quality base will not cause removal of surrounding high quality bases.

  • sliding window size: controls the size of the window (in bases)
  • sliding window quality threshold: minimum average quality (as a phred value) for a window and consequent bases to be retained.

Both parameters must be specified: typical examples use a 4-base wide window, cutting when average quality drops below 15.
Note that you can use these parameters to filter low quality reads by setting the window size equal to the length of the reads. If you're not sure about the length of the reads you can find this info in the Bais statistics module of FastQC.


GP22.png

Min Read Length parameters: minimum length of reads after clipping