Parameters of Picard

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS data analysis | RNASeq analysis for differential expression in GenePattern ]

Picard is a set of tools for handling sam and bam files.

SortSam

Sorts and indexes bam and sam files.

  • Input: bam file to sort.
  • Output: The sorted bam output file. The tool automatically produces a .bai file containing the index of the bam file.
  • SortOrder: Sort order of the output file. Possible values: unsorted, queryname, coordinate, duplicate. Mot applications expect you to sort on coordinates.


MarkDuplicates

The tool will identify duplicates based on their sequence (PCR duplicates) and on their location on the Illumina flow cell (optical duplicates). In RNA-Seq analysis, it is not advised to remove duplicate reads unless you keep track of their counts. Therefore limit this duplicate analysis to simply marking the duplicate reads so that the next steps in the workflow know that they are duplicates.

  • Input: one or more files, reads can be mapped or unmapped.
  • Read Name Regex: regular expression that can be used to parse read names in the incoming input file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurately estimated library size. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. More info here.
  • Optical Duplicate Distance: maximum distance in pixels between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models (HiSeq X), 2500 is more appropriate. Default value: 100.