NGS RNASeq DE Exercise.1a
Download the bowtie2 reference genome index for hg19 and related files
Download pre-build indexes
The tasks described here were done for you on the BITS laptops. If you need to create your own working environment, you will need to follow the code provided here and create the suited folders on your own computer.
We will map the reads against the most popular reference genome today, being the hg19 build produced by the NCBI and related to many UCSC tables. Mapping requires a specially formatted file set (hash database), derived from the chosen reference genome, and that needs to be built using the bowtie2 software or obtained 'ready-to-use' as done here . We will also need the fasta copy of the same hg19 genome; one easy way to obtain this fasta is to generate it from the hash table using bowtie2. This ensures that both entities share the same sequence, which is not necessarily the case when downloading data from different web servers.
#!/bin/bash # get_reference.sh # NOTE: tophat links to another resource from Illumina with more reference data # http://tophat.cbcb.umd.edu/igenomes.shtml # we prefer here to get the reference from the Bowtie site ## Download the pre-built 'hg19' indexes from the bowtie2 website (3.4GB!) # http://bowtie-bio.sourceforge.net/bowtie2/index.shtml # adapt if your path is different, both places should be writable to allow software & data installation export BIODATA=/opt/biodata export BOWTIE2_INDEXES=$BIODATA/bowtie_indexes # we first get the bowtie2 hg19 index as follows wget -P $BOWTIE2_INDEXES/ --timestamping \ 'ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/hg19.zip' # unzip the files to $BOWTIE2_INDEXES unzip -d $BOWTIE2_INDEXES $BOWTIE2_INDEXES/hg19.zip ## Recreate the fasta version of the genome from the index using bowtie2-inspect # the sequence will exactly correspond to that used to create the index # this is a good thing because different sources have different genome flavors bowtie2-inspect $BOWTIE2_INDEXES/hg19 \ > $BOWTIE2_INDEXES/hg19.fa ## below we create two additional index files that will be required for various tools including Picard tools and samtools ## index the genome fasta file for rapid access samtools faidx $BOWTIE2_INDEXES/hg19.fa ## also create a dictionary file that some 3rd party program may need java -jar $PICARD/picard.jar CreateSequenceDictionary \ R=$BOWTIE2_INDEXES/hg19.fa \ O=$BOWTIE2_INDEXES/hg19.dict
Inspecting the top of the obtained dictionary file reveals its true nature. This file contains the @SQ lines that will be reproduced in the future SAM/BAM files after mapping of the reads to this reference genome and describes name and length for each reference chromosome.
head -5 hg19.dict @HD VN:1.4 SO:unsorted @SQ SN:chr1 LN:249250621 UR:file:/opt/biodata/bowtie_indexes/hg19.fa M5:1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:chr2 LN:243199373 UR:file:/opt/biodata/bowtie_indexes/hg19.fa M5:a0d9851da00400dec1098a9255ac712e @SQ SN:chr3 LN:198022430 UR:file:/opt/biodata/bowtie_indexes/hg19.fa M5:641e4338fa8d52a5b781bd2a2c08d3c3 @SQ SN:chr4 LN:191154276 UR:file:/opt/biodata/bowtie_indexes/hg19.fa M5:23dccd106897542ad87d2765d28a19a1
create the transcript index for efficient mapping of junction-spanning reads
Most people rely for human data on either RefGene or Ensembl genes for this process. We present example code for each reference below, please adapt to obtain the other reference standard.
When mapping RNASeq data to the genome reference, a transcriptome index would be created by tophat based on the provided gtf file at runtime but this would be done for each new run, thereby loosing time.
We can instead prepare the required transcriptome indexes in advance using a 'dry' tophat run (without reads).
# prepare transcriptome indexes for Ensgenes tophat -G $BOWTIE2_INDEXES/hg19_ensGene.gtf \ --transcriptome-index=$BOWTIE2_INDEXES \ $BOWTIE2_INDEXES/hg19
Alternatively, we can assemble a transcript indexes using bowtie2-build
# first create a fasta file from the gtf file and the full genome fasta gtf_to_fasta $BOWTIE2_INDEXES/hg19_refGene.gtf \ $BOWTIE2_INDEXES/hg19.fa \ $BOWTIE2_INDEXES/hg19_refGene.fa # we then create the index using bowtie2-build bowtie2-build $BOWTIE2_INDEXES/hg19_refGene.fa \ $BOWTIE2_INDEXES/hg19_refGene
In some exercises, we will use only mappings to chr22 and will need a corresponding transcriptome annotation file. Such file can easily be created using grep and a full genome gtf file as shown below.
# keep only lines starting by 'chr22' grep "^chr22" hg19_ensGene.gtf > chr22-hg19_ensGene.gtf grep "^chr22" hg19_refGene.gtf > chr22-hg19_refGene.gtf
You should now have a folder containing bowtie2 index files as well as the full genome in fasta format and files providing information about the transcript models and other accessory files. This reference data is ready for use with various software installed on the BITS laptop as detailed in other pages of this session.