NGS RNASeq DE Exercise.1a

From BITS wiki
Jump to: navigation, search

Download the bowtie2 reference genome index for hg19 and related files

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 | NGS_RNASeq_DE_Exercise.2 ]


Download pre-build indexes

The tasks described here were done for you on the BITS laptops. If you need to create your own working environment, you will need to follow the code provided here and create the suited folders on your own computer.

Handicon.png The index was already produced on the BITS training laptops and this part does not need to be done during the session

We will map the reads against the most popular reference genome today, being the hg19 build produced by the NCBI and related to many UCSC tables. Mapping requires a specially formatted file set (hash database), derived from the chosen reference genome, and that needs to be built using the bowtie2 software or obtained 'ready-to-use' as done here [1]. We will also need the fasta copy of the same hg19 genome; one easy way to obtain this fasta is to generate it from the hash table using bowtie2. This ensures that both entities share the same sequence, which is not necessarily the case when downloading data from different web servers.

Technical.png Downloading the h19 fasta file and building the index on the host computer from it is also possible but takes significantly more computing time

#!/bin/bash
# get_reference.sh
 
# NOTE: tophat links to another resource from Illumina with more reference data
# http://tophat.cbcb.umd.edu/igenomes.shtml
# we prefer here to get the reference from the Bowtie site
 
## Download the pre-built 'hg19' indexes from the bowtie2 website (3.4GB!)
# http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
 
# adapt if your path is different, both places should be writable to allow software & data installation 
export BIODATA=/opt/biodata
export BOWTIE2_INDEXES=$BIODATA/bowtie_indexes
 
# we first get the bowtie2 hg19 index as follows
wget -P $BOWTIE2_INDEXES/ --timestamping \
	'ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/hg19.zip'
 
# unzip the files to $BOWTIE2_INDEXES
unzip -d $BOWTIE2_INDEXES $BOWTIE2_INDEXES/hg19.zip
 
## Recreate the fasta version of the genome from the index using bowtie2-inspect
# the sequence will exactly correspond to that used to create the index
# this is a good thing because different sources have different genome flavors
bowtie2-inspect $BOWTIE2_INDEXES/hg19 \
	> $BOWTIE2_INDEXES/hg19.fa
 
## below we create two additional index files that will be required for various tools including Picard tools and samtools
 
## index the genome fasta file for rapid access
samtools faidx $BOWTIE2_INDEXES/hg19.fa
 
## also create a dictionary file that some 3rd party program may need
java -jar $PICARD/picard.jar CreateSequenceDictionary \
	R=$BOWTIE2_INDEXES/hg19.fa \
	O=$BOWTIE2_INDEXES/hg19.dict

Inspecting the top of the obtained dictionary file reveals its true nature. This file contains the @SQ lines that will be reproduced in the future SAM/BAM files after mapping of the reads to this reference genome and describes name and length for each reference chromosome.

head -5 hg19.dict 
@HD     VN:1.4  SO:unsorted
@SQ     SN:chr1 LN:249250621    UR:file:/opt/biodata/bowtie_indexes/hg19.fa     M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ     SN:chr2 LN:243199373    UR:file:/opt/biodata/bowtie_indexes/hg19.fa     M5:a0d9851da00400dec1098a9255ac712e
@SQ     SN:chr3 LN:198022430    UR:file:/opt/biodata/bowtie_indexes/hg19.fa     M5:641e4338fa8d52a5b781bd2a2c08d3c3
@SQ     SN:chr4 LN:191154276    UR:file:/opt/biodata/bowtie_indexes/hg19.fa     M5:23dccd106897542ad87d2765d28a19a1

 

create the transcript index for efficient mapping of junction-spanning reads

Most people rely for human data on either RefGene or Ensembl genes for this process. We present example code for each reference below, please adapt to obtain the other reference standard.

When mapping RNASeq data to the genome reference, a transcriptome index would be created by tophat based on the provided gtf file at runtime but this would be done for each new run, thereby loosing time.

We can instead prepare the required transcriptome indexes in advance using a 'dry' tophat run (without reads).

Technical.png we show here the command to create the ensGene index

# prepare transcriptome indexes for Ensgenes
tophat -G $BOWTIE2_INDEXES/hg19_ensGene.gtf \
    --transcriptome-index=$BOWTIE2_INDEXES \
    $BOWTIE2_INDEXES/hg19

Alternatively, we can assemble a transcript indexes using bowtie2-build

Technical.png we show here the command to create the refGene index

# first create a fasta file from the gtf file and the full genome fasta
gtf_to_fasta $BOWTIE2_INDEXES/hg19_refGene.gtf \
    $BOWTIE2_INDEXES/hg19.fa \
    $BOWTIE2_INDEXES/hg19_refGene.fa
 
# we then create the index using bowtie2-build 
bowtie2-build $BOWTIE2_INDEXES/hg19_refGene.fa \
    $BOWTIE2_INDEXES/hg19_refGene

In some exercises, we will use only mappings to chr22 and will need a corresponding transcriptome annotation file. Such file can easily be created using grep and a full genome gtf file as shown below.

# keep only lines starting by 'chr22'
grep "^chr22" hg19_ensGene.gtf > chr22-hg19_ensGene.gtf 
grep "^chr22" hg19_refGene.gtf > chr22-hg19_refGene.gtf

Technical.png When you would need to filter on another column than the leftmost column, awk would do the job as nicely.

 

Conclusion

You should now have a folder containing bowtie2 index files as well as the full genome in fasta format and files providing information about the transcript models and other accessory files. This reference data is ready for use with various software installed on the BITS laptop as detailed in other pages of this session.

Technical.png We do not provide these files on our data server due to size constrains


References:
  1. http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 | NGS_RNASeq_DE_Exercise.2 ]