ChIP-Seq CLI tutorial

From BITS wiki
Jump to: navigation, search

The content of this page was adapted from the tutorial material provided with PeakAnalyzer.

Introduction

  • install PeakAnalyzer on your computer by following the instructions provided on the website.
  • create a tutorial folder somewhere on your machine with name PeakAnalyzer _tutorial.
  • download the tutorial data from Here and unarchive it to a folder called data in your tutorial folder.
  • PeakAnalyzer comes with human and mouse GTF annotation files, make sure that you have these GTF files in the Data folder within the PeakAnalyzer folder. You can download other files for your organism of interest from [1].


ensembl_gtf-ftp.png
<br\>
  • You are now ready to proceed with the tutorial as provided by the PeakAnalyzer developpers.

Starting the GUI

For this very basic tutorial, we will use the graphical interface. Direct CLI usage is also possible, please refer to the hidden instructions below for detailed command syntax.

  • GUI: launch the program by double-clicking on "PeakAnalyzerGUI.jar" icon
  • CLI: open a terminal window, navigate to the PeakAnalyzer folder and type:
java -Xmx2G -jar PeakAnalyzerGUI.jar

this will dedicate 2Gb of RAM to the program, please adapt to your available memory).

• PeakAnalyzer Overview

////////////////////////////////////////////////////////////////////////////////
INSTALLATION
////////////////////////////////////////////////////////////////////////////////

Unpack the PeakAnalyzer.zip file, and move to the PeakAnalyzer folder. This folder
contains two jar files:
1. PeakAnalyzer.jar
2. PeakAnalyzerGui.jar
PeakAnalyzer.jar is the executable jar file you have to run. The directory also
includes a "Data" directory which contains annotation files for human and mouse.

////////////////////////////////////////////////////////////////////////////////
REQUIREMENTS
////////////////////////////////////////////////////////////////////////////////

1. Java 1.5 or later installed. If running PeakAnalyzer under Windows, Java 1.6_10
or later is recommended.
2. A "Data" folder MUST be present in the folder where you run PeakAnalyzer.
3. You have to be connected to the internet in order to retrieve subpeak sequences.
4. R is required in order to generate result plots.

You can download R from http://www.r-project.org/.
R bin directory should be defined in the PATH

On Linux/Unix, using bash, you need to include the R/bin directory to the PATH by
adding this line to your ~/.bashrc :  

$ export PATH="<location of R installation directory>/R-X.X.X/bin:$PATH"

If you use the C shell, add the following to your .cshrc file
setenv PATH <location of R installation directory>/R-X.X.X/bin:${PATH}

Using Windows, you need to open the System Properties dialog.
In the Advanced section click the Environment Variables button.
Then in the Environment Variables window, highlight the Path variable in the
Systems Variable section and click the Edit button.
Add or modify the path lines with the path to the R/bin directory.
For example: "C:\Program Files\R\R-X.X.X\bin"

////////////////////////////////////////////////////////////////////////////////
USAGE
////////////////////////////////////////////////////////////////////////////////

To launch the program, double click on "PeakAnalyzer.jar", or open a terminal
window, navigate to the PeakAnalyzer folder and type:
java -jar PeakAnalyzer.jar

////////////////////////////////////////////////////////////////////////////////
DOCUMENTATION
////////////////////////////////////////////////////////////////////////////////

PeakAnalyzer is a Java GUI application comprising two main utilities:
1. PeakAnnotator - for annotating genomic loci
2. PeaksSplitter - for subdividing broad peaks into individual binding sites  

The following documentation describes the parameters of each GUI window.

First window - "Choose Utility"
==================================
In this window you have to choose which application to run. The options are:
1. Peak Annotation - choose this option if you want to perform bulk annotation of
genomic locations, such as identifing location within genes or closest up- or
downstream transcription start site
2. Split Peaks - choose this option if you want to split enrichment areas into
individual binding sites. This should be done prior to de novo motif analysis.

Peak Annotation window
=========================
In this window you can choose which annotation utility to run. The options are:
1. NDG - For each locus, search for its Nearest Downstream Gene on both the
forward and reverse strand. If the position of the locus is within a gene,
the program describes in which part of that gene the locus is situated (for
example exon, first intron, etc.).
2. TSS - For each locus, find its closest TSS (transcription start site). In
order to do this, the program searches for the closest either upstream or
downstream gene compared to the genomic coordinate of the locus.
3. ODS - Overlapping two data sets (peak files), to identify common and
unique genomic locations. Uses random regions matched for chromosome and
length to calculate an enrichment over random and p-value.

Nearest Downstream Genes/Nearest TSS window
============================================
Under this window, you have to select the input parameters for the NDG/TSS
utilities. The parameters are:

*** Peak file
This is a REQUIRED parameter for the "NDG" and "TSS" utilities.
The file lists the genomic coordinates that were found by a peak calling program
or obtained in some other way. The format should be tab or space delimited, where
each locus is described by its "chromosome", "start" and "end" location.
PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT

*** Annotation file
This is a REQUIRED parameter for the "NDG" and "TSS" utilities.
The file lists the features/genes of interest and their location in the genome, in
one of two formats:
1. GTF format - can be download from Ensembl ftp site at:
                http://www.ensembl.org/info/data/ftp/index.html
                GTF FILES ARE EXPECTED TO CONTAIN THE SUFFIX ".gtf"
2. BED format - that can be downloaded from the UCSC table browser tool.
                The BED format is defined in "http://genome.ucsc.edu/FAQ/FAQformat#format1".

Requirements for BED file format - NDG utility:
The following fields (columns) should be present:
chrom, chromStart, chrEnd, name, strand, thickStart, thickEnd, blockCount,
blockSizes, blockStarts.

Requirements for BED file format - TSS utility:
The following fields (columns) should be present:
chrom, chromStart, chrEnd, strand,

Please note that according to BED format, lower-numbered fields (columns) must
always be present if higher-numbered fields are used. Hence, although the field
"name" is not required for TSS, it should be specified in the file. (Just
inserting any character in column number 4 in the file is sufficient).

Sample annotation files for human and mouse are provided with the program, and are
located in the sub directory "Data" in the PeakAnalyzer folder.

*** Gene type options
When the annotation file is of GTF format, the user has the option to choose the
type of genes that will be used for annotation, either "Coding genes only" or
"Coding and non-coding genes". The latter includes genes such as miRNAs and other
non-coding RNAs.

*** Symbol file
This is an optional parameter for both the "NDG" and "TSS" utilities.
The symbol file maps accession numbers to gene symbols and can be downloaded for
example from the UCSC table browser. It is necessary when using BED format
annotation file, since these do not contain gene symbols, whereas for the Ensembl
GTF annotation files a symbol file is not required.

*** Output folder
This is a REQUIRED parameter.
An output directory must be specified, for Peak Annotation to put the output files
in.

*** Prefix
String to add to output file names, for example in case the same peak files are
analyzed using different parameters.
If you choose to run the program using the same input files several times,
and you don't use the prefix option, then the output files will be overwritten.

Overlap peak lists window
===========================
You will get to this window if you chose the "Overlap" utility

The input parameters are:

*** Peak file1
This is a REQUIRED parameter.
The file lists the genomic coordinates that were found by a peak calling  program
or obtained in some other way. The format should be tab or space delimited, where
each locus is described by its "chromosome", "start" and "end" location.
PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT

*** Peak file2
This is a REQUIRED parameter.
A second peak file to be compared with Peak file1.

*** Output folder
This is a REQUIRED parameter.
An output directory must be specified, for Peak Annotation to put the output files
in.

*** Prefix
String to add to output file names, for example in case the same peak files are
analyzed using different parameters.

*** Randomization
You can choose this box if you would like to calculate the significance of the
intersection and the fold enrichment over random.
If this option is checked, random regions matched to the first regions file will
be generated, and intersect with the second.
In order to create random data sets, you have to provide ChrLength file

*** ChrLength file
File containing the size of each chromosome, for example:
chr1    197195432
chr2    181748087


Split Peaks window
====================
You will get to this window if you chose the "Split Peaks" option.

Input parameters:

*** Peak File
This is a REQUIRED parameter.
The file lists the genomic coordinates that were found by a peak calling program
or obtained in some other way. The format should be tab or space delimited, where
each locus is described by its "chromosome", "start" and "end" location.
THIS FILE SHOULD BE SORTED BY CHROMOSOME AND START POSITION
PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT

*** WIG type
This is a REQUIRED parameter.
This can be a WIG file OR a WIG folder that contains one WIG file for each
chromosome, where the WIG file describes the signals (usually number of reads)
along the genome. Split Peaks supports WIG files in VariableStep or Bedgraph formats.
The wig header lines, "track type" and "variableStep" (when the file is of VariableStep format)
are required.
The files can be zipped or gzipped, so it's not necessary to uncompress them. WIG
file names for each chromosme (under WIG folder) should contain the word "chr" +  
chromosome number, for example "my.chr12.wig".

*** Output Folder
This is a REQUIRED parameter.
An output directory must be specified, for Split Peaks to put the output files in.

*** Prefix
String to add to output file names, for example in case the same peak files are
analyzed using different parameters.

*** Fetch subpeak sequences
Check this box if you would like to fetch the subpeak sequences surrounding the
summit regions (where for example binding site are located). You have to be
connected to the internet in order to fetch sequences.

Split and Fetch sequence parameters window
============================================

Split parameters:

*** Separation float
This value determines when a peak will be  separated into subpeaks. Local maxima
regions are found within each peak and the height of neighboring local maxima are
compared. The lowest value is multiplied by this separation float number to yield
the minimum depth required to separate the two peaks.
For example, a value of 0.5 means that the height of the valley should be less
than half the height of its summits in order for them to be separated.

*** Minimum height
Height cutoff. Only subpeaks with at least this number of reads in their summit
region will be reported.

Fetch sequence parameters:

Please note that the sequences are fetched from the latest build of the genome.

*** Organism
The sequences are fetched directly from the Ensembl DAS database. The user has to
specify the organism, and PeakAnalyzer will fetch the corresponding sequences.

*** Length
Length of sequence to fetch (default 60).
The sequences are fetched near the summit region, so if the length  is 60, 30 bp
will be fetched upstream to the peak summit position, and 30 bp downstream.

*** Amount
Number of best subpeak sequences to fetch (those with the highest numbers of reads
in their summit region). These sequences can be used as input for motif prediction
tools such as MEME.
The default number is 300. This is the maximum number of sequences the web-based
version of MEME will accept (more sequences can be input when run locally).


////////////////////////////////////////////////////////////////////////////////
OUTPUT FILES
////////////////////////////////////////////////////////////////////////////////

Peak Annotation outputs:
------------------------

The output of the "NDG" utility are three tab delimited files:
**************************************************************

A. "peakFileName.ndg.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test, the output file will be
"myPeaks.ndg.test".
This file describes the closest downstream genes for each genomic locus, and
contains the following fields:
        1. Chromosome
        2. Start
        3. End - These first three columns describe the location of the peak in
                the genome.
        4. # Overlapped_Genes - Number of transcripts that overlap with the
                genomic loci.
           More details about these genes are reported in the second output file
                described below.
        5. Downstream_FW_Gene - ID of the closest downstream gene on the forward
                strand.
        6. Symbol - Symbol of the closest downstream gene on the forward strand.
        7. Distance - Distance of the peak to its closest downstream gene on the
                forward strand.
        8. Downstream_REV_gene - ID of the closest downstream gene on the reverse
                strand.
        9. Symbol - Symbol of the closest downstream gene on the reverse strand.
        10. Distance - Distance of the peak to its closest downstream gene on the
                reverse strand.

B. "peakFileName.overlap.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test", the output file will be
"myPeaks.overlap.test".
This file describes the transcripts overlapping the peaks, if any such are found.
        1. Chromosome
        2. Start
        3. End  - These first three columns describe the location of the peak in
                the genome.
        4. OverlapGene  - Overlapping gene ID
        5. Symbol       - Overlapping gene symbol
        6. Overlap_Begin - In which part of the gene does the peak's start position
                overlap
        7. Overlap_Center - In which part of the gene does the peak's central
                position overlap
        8. Overlap_End  - In which part of the gene does the peak's end position
                overlap

C. "peakFileName.summary.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test", the output file will be
"myPeaks.summary.test".
This file contains the following fields
        1. Chromosome
        2. Start
        3. End  - These first three columns describe the location of the peak in
                the genome.
        4. OverlapGene  - Overlapping gene Symbol.
        5. Downstream Gene - Nearest downstream gene.
        6. Distance - Distance between the peak and its nearest downstream gene.

The output of the TSS option is a tab delimited file:
*****************************************************

"peakFileName.tss.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test", the output file will be
"myPeaks.tss.test"

This file contains the following fields:
        1. Chromosome
        2. Start
        3. End  - These first three columns describe the location of the peak in
                the genome.
        4. Distance     - The distance from the peak to its closest TSS.
        5. GeneStart    - The start location of the closest gene on the genome.
        6. GeneEnd      - The end location of the closest gene on the genome.
        4. ClosestTSS_ID - ID of the closest gene.
        5. Symbol       - Symbol of the closest gene.
        6. Strand       - Strand of closest gene.

The output of the "Overlap" option are three tab delimited files:
******************************************************************

A. "peakFile1_peakFile2.overlap.txt"

For example, if the input peak files are "myPeaks1.txt" and "myPeaks2.txt", the
output file will be "myPeaks1_myPeaks2.overlap.txt"

Each line in this file describes an overlap event between two genomic loci, and
has the following fields:
        1. Chromosome
        2. peakFile1_Start      - Start location of the first genomic locus
        3. peakFile1_End        - End location of the first genomic locus
        4. peakFile1_Name       - Name of the first genomic locus (if it exist in
                the input file)
        5. peakFile2_Start      - Start location of the second genomic locus
        6. peakFile2_End        - End location of the second genomic locus
        7. peakFile2_Name       - Name of the second genomic locus (if it exist in
                the input file)

B+C. Unique files - one file for each genomic input file, which describes the unique peaks.


PeakSplitter output
-----------------
If you specified a text under the "prefix" parameter, all output file names will
start with the text you mentioned.

1. peakFileName.subpeaks.inputFileNameSuffix
For example, if the input peak file is "myPeaks.test", the output file will be
"myPeaks.subpeaks.test"

This is a tabular file, which contains information about subpeaks, including
chromosome name, start position of subpeak, end position of subpeak, number of
reads in peak summit position  and subpeak summit position related to the start
position of subpeak region.

2. peakFileName(without suffix).bestSubpeaks.fa
For example, if the input peak file is "myPeaks.test", the output file will be
"myPeaks.bestSubpeaks.fa

This is a fasta file, containing the sequences of the best subpeaks (those with
highest number of reads in their summit position).
The fasta file can be uploaded to a motif prediction program such as MEME.

Plots
*****
There is an option to generate summary plots of the data.
In order to generate them you need to push the button "Generate plot"
that will appear after the program has been executed.

Nearest downstream genes (NDG) plots:
1. Peaks overlapping genes
The position of peaks within genes are plotted. This is plotted based
on the location of the central point of the peak region.
Sometimes, the central point fall out of a known gene (although the peak itself
overlap the gene), in this case, the overlapping region is defined as "Intergenic".
2. Distance to NDG
The distance is calculated between the central point of the peak to the TSS
of the nearest downstream gene.
Distance is always a positive value.

Transcription start site (TSS) plot:
1. Distance from TSS
The distance is calculated between the central point of the peak to the TSS
of the nearest gene.
Since the distance is calculated to nearest TSS rather than the nearest
downstream TSS, the values can be both positive and negative.


Tutorial



References:



[ Main_Page ]