Peak calling with MACS

From BITS wiki
Jump to: navigation, search

Using the mapping results we can define the peaks, the regions with a high density of reads in the ChIP sample, where the transcription factor was bound.

There are multiple programs to perform the peak calling. Some are more directed towards histone marks (broad peaks) while others are specific to narrow peaks (transcription factors). Here we will use MACS because it's known to produce generally good results, and it is well-maintained by the developer.

MACS is installed on GenePattern. Check the documentation on GenePattern or read the manual on the MACS github.

Let's see the parameters of MACS before launching the peak calling.

Note that the bam files need to be sorted according to genomic location. At this point they are not, the reads are in the same order as they were in the fastq file, according to position on the flow cell.

Let's go over the different parameters of MACS:

The effective genome size is the size of the genome considered "usable" for peak calling. This value is given by the MACS developers on their website. It is smaller than the complete genome because many regions are excluded (telomeres, highly repeated regions...). The default value is for human (2700000000), so we need to change it. As the value for E. coli is not provided, we will take the complete genome size 4639675.

MACS needs the length of the fragments, which are longer than the read length, because the sequencer sequences only parts starting from the end of the fragments. MACS2 does this by making a model of enrichment of reads in the ChIP sample versus the background, searching pairs of peaks within a bandwidth of 300 bases with an enrichment ratio between 5 and 50. If there are not enough pairs of peaks, as is the case in our data, you can fall back on using a preset fragment length by setting the model parameter to no. The default of shift 0 extsize 200 is adequate for ChIPSeq. It means that reads are extended to a length of 200 bases before they are counted.

The duplicates specifies how MACS should treat the reads that are mapped to the exact same location (duplicates). The manual specifies that keeping only 1 representative of these "stacks" of reads is giving the best results.

The make BedGraph parameter will output a file in BEDGRAPH format to visualize the peak profiles in a genome browser. There will be one file for the treatment, and one for the control.

FDR and FDR for broad peaks indicates that MACS will report peaks if their associated p-value is lower than the value specified here. Use a relaxed threshold as you want to keep a high number of peaks (even if some of them are false positives).

Look at the files that were created by MACS.