Introduction to ChIP-Seq analysis

From BITS wiki
Jump to: navigation, search

Mapping reads with Bowtie

Exercise created by Morgane Thomas Chollier

Obtaining the reference genome

In the ChIP-Seq experiment of E. coli we want to see which genomic regions are bound to transcription factor FNR. However, at this point what we have is a set of reads that are identified by their location on the flow cell. To answer our question we should link the reads to regions in the genome to obtain their genomic coordinates. This process is called mapping.

For Illumina reads the standard mappers are BWA and Bowtie (version 1 and 2).

The Bowtie_1 aligner is installed on GenePattern. Check the documentation on GenePattern or read the manual on the Bowtie website.

Bowtie needs the complete genome, in FASTA format as a reference sequence to align the reads to.

The genome sequence of E. coli K-12 MG1655 needs to be in a specific format (=index) for bowtie. Several pre-built indexes are available to download on the bowtie webpages or the iGenomes website.

Although the E. coli sequence is available we will not use it to show you how you should proceed if you don't find your reference sequence on this website. In that case you will need to make the index yourself.

If you can't find your reference on the iGenomes website you have to download it from:

Since Ensembl focuses on higher eukaryotes, we are going to download the genome from NCBI.

NCBI Nucleotide is notorious for the amount of errors it contains both in sequences and in annotations. Therefore, if available you should always use sequences from RefSeq, the clean subset of NCBI’s Nucleotide database. This sequence is not a RefSeq sequence. You can see that because the accession number does not contain an underscore and all RefSeq accession numbers contain an underscore.

This brings us to a RefSeq record with accession number NC_000913.

This creates a file called sequence.fasta in the Downloads folder of your computer.

If the upload takes too long use the fasta file from the SHARED_DATA folder in GenePattern.


Indexing the reference genome

You cannot do the mapping directly on the .fasta file, you need to index the file first. Reference genomes from the Bowtie/iGenomes website are already indexed so when you get your reference there you can skip this step. Reference genomes downloaded from NCBI, Ensembl or UCSC need to be indexed using the Bowtie_1 indexer tool.

Indexing a reference genome is a one-time effort: you do not have to repeat it each time you do a mapping.

Check the documentation of the Bowtie_1 indexer to see the parameters it takes. The documentation shows that you need to specify:

  • the reference genome that you want to index as an input (in our case the E. coli fasta file)
  • the name of the indexed output file

Give the output file the same name as the input file: Escherichia_coli_K12. The Bowtie indexer will generate a zip file containing a whole set of .ebwt files whose name all start with Escherichia_coli_K12.

Copy the zip-file to your Uploads folder.

Mapping the reads

Open the Bowtie_1 aligner parameter form.

You need to tell bowtie what type of file your input file is.

FastQ is the default, so you don't have to explicitly set this option.

Bowtie has two modes of mapping. The simplest strategy is called v-mode alignment: you align complete reads (from the first to the last base aka end-to-end) to the reference and you count the number of mismatches in this alignment. In this mode quality values are ignored and you need to tell bowtie the maximum number of mismatches you allow.

Remember because the base quality at the 3'end of the reads is lower, base calls at the 3'ends are often incorrect. This will inevitably lead to mismatches in the alignments. Reads with more than 2 mismatches will not be reported. To avoid losing too many reads during the mapping we can either trim low quality bases from the 3' ends of the reads before the alignment is done or use a mapping strategy that takes into account the quality scores of the bases.

This strategy is called n-mode alignment. It's the default mode. It aligns seeds, the first N bases of the reads at the high quality 5'end, to the reference. You have to set the length of the reads and the maximum number of mismatches allowed in the seed alignment. Additionally the sum of the quality scores at all mismatched positions (not just in the seed) is calculated and you can set a maximum for this parameter. In this way, reads with mismatches with high quality scores will not be reported whereas mismatches with low scores are more or less ignored.

The FASTQC report showed that the last base is of low quality. Since the reads are 36 bases ling we could use seeds of 35 bases for the mapping.

We also need to specify that we only want to report reads that map specifically to one location in the reference.

By default, bowtie will include unmapped reads in the output file. That's unnecessary since no one uses these unmapped reads.

We want to get a rough idea of the quality of the mapping. Look at the stdout.txt file that was generated by bowtie to get the basic statistics of the mapping.


GPBowtie5.png

You see that 62% of the reads were mapped. This may seem low but remember that we haven't done any cleaning on the file. According to FASTQC the file contains about 30% of adapter sequences that will not map.

The output of Bowtie is a sam file. The SAM format corresponds to large text files, that can be compressed ("zipped") into .bam files that take up to 4 times less disk space and are usually sorted and indexed for fast access to the data they contain. The index of a .bam file is named .bai and some tools require these index files to process the .bam files. So we need to transform the .sam file with our mapping results to a sorted .bam and .bai file. You can use one of the tools from the Picard toolbox for this.


Repeat the analysis for the control sample SRR576938.fastq These two fastq files come from a ChIPSeq experiment, the first contains the reads of the ChIP sample, the second of the control sample, which consists of fragmented genomic DNA. You need both to identify regions in the genome that are represented more in the ChIP reads than in the control (these are the regions that bind to the transcription factor).

Suppose that you have many fastq files that you need to map to the E. coli genome. The best way to ensure that you can reuse tools and parameter settings during the next analysis, is to combine them into a pipeline.

Now you use the pipeline as a regular module.

At this point, you have two sam and two bam files, one for the treated sample, one for the control sample.