NGS variant analysis: laptop configuration and files

From BITS wiki
Jump to: navigation, search


[ Main_Page | NGS_data_analysis | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.1 ]


BITS laptop configuration

In order to ease the typing during this workshop, all laptops were cloned from a prototype installation on which a number of variables were defined. When reproducing a command or typing it, please use the defined variables to avoid typos. Also try autocompletion to speed up repetitive commands and recycle code from your history.

People present today should have followed a previous BITS training introducing a number of unix commands and tools that will be extensively used today.

default environment variables

When starting a terminal session, a default script will be sourced (run) that defines a number of aliases and default variables. This script defines a number of shortcuts, for all BITS laptops, that will ease your typing. During this training we will almost always work while in the BASE folder defined below so we can omit it in our commands

Cli tools.png Bash environment variable are created without leading '$' and used in a command with a leading '$'

# your 'sweet Home', you will often be prompted to check you are there

  • HOME=/home/bits # the standard home folder associated to any unix user (today, you all have the user name 'bits')
  • BASE=${HOME}/NGS_DNASeq-training (this folder contains most of what was created for this workshop except what was installed in the system hierarchical tree, it will also host all files that you will generate today)

# the place where special packages were installed for this training

  • BIOTOOLS # the custom program location searched by the system to find all executable NGS / BioWare programs

# the folder where all reference files from public repositories were downloaded or built for BITS workshops including this one. Note that an additional ref folder was created within the $BASE folder to house additional reference data specific to this training.

  • BIODATA=/opt/.../biodata

# a shortcut to refer to picard and snpEff java commands, note that bin could be substituted in that configuration

  • PICARD # location of the 'picard' java suite (see information on how to launch a java program here)
  • SNPEFF # location of the 'snpEff/snpSift' java suite
  • TRIM # location of the 'trimmomatic' java suite
  • VARSCAN # location of the 'varscan2' java suite

# a folder for bash scripts and mini-workflows used today

  • SCRIPTS=${BASE}/scripts

basic directory structure at the beginning of the workshop

When starting terminal, you will always get a list of folders present in the Main project folder in order to help you find the data.

terminal startup window

startup-Terminal.png

When starting this session, you will find a number of folders in $BASE that contain the raw data necessary to proceed.

The basic project folder structure is as follows:

(Unix System hierarchical tree)

|--HOME--|
      ├─NGS_DNASeq-training ($BASE)
             ├── Final_Results (where all pre-computer results were moved)
             ├── bundle_2.5 (Broad reference files required for GATK and more...)
             ├── public_info (gold-standard data from the internet)
             ├── ref (reference files for biocomputing applications)
             └── scripts  (scripts to perform parts of the workflow)

( more folders will be added in this level during the workshop )

The special folder called Final_Results contains all result folders moved after completing the full workshop and where you can copy intermediate files if you do not succeed in creating them yourself. Please try not to abuse of this resource as you will mainly learn by running your own command rather than copying the solution.

Cli tools.png All commands are run from the $BASE folder in order to have direct access to all folders prepared by the trainer.
When you have a doubt, look at your prompt and/or type 'cd $BASE' to return to the working folder

launching a java command

When using tools programmed in Java language, calling the tool itself is not enough, instead you need to tell java to start it and provide a minimum of information like the maximum RAM space allocatable to the tool. For a maximum 4GB space, this is done by typing

Handicon.png java -Xmx4g -jar the-program.jar <here all subsequent program arguments (separated by 'space')>

For your convenience, the path to picard tools has been stored into a variable $PICARD and a second one into a vzariable SNPEFF. In order to use these variable in the terminal you may need to do a little gym as detailed in our Q&A page.

More keyboard shortcuts can be found in our cheat sheet Bash_Keyboard_Shortcuts.pdf.

download and organise reference material

Although this was done on your BITS laptop, you may wish to reproduce the training environment later on on another computer. The following information will guide you through the process of obtaining reference data and preparing it properly. Installation of the software part is not detailed here and will require a minimum of IT experience to be done correctly. Please refer to each tool website to obtain source and installation information and make sure that all installed executable tools are found by your environment (in the $PATH).

download the hg19 reference sequence

Please do not run this code during the training, it has been done already and takes quite some time.

# create a 'ref' folder in current directory (if absent)
mkdir -p ref
 
# download the quite large (~6.5Gb) UCSC_hg19 build archive from Illumina in 'ref/'
buildarchive="HiSeqAnalysisSoftware_UCSC_hg19.tar.gz"
wget -P ref/ ftp://webdata:webdata@ussd-ftp.illumina.com/Data/ReferenceGenomes/${buildarchive}
 
# extract the hg19 reference fasta file (genome.fa) 
#  from an internal archive (packed_reference.tar.gz) 
#   within the main archive ${buildarchive}
#    and rename it (HiSeq_UCSC_hg19.fa) in only one command !!
tar -x Sequence/IsaacIndex/packed_reference.tar.gz \
	-zf ref/HiSeqAnalysisSoftware_UCSC_hg19.tar.gz -O |\
	tar -x genome.fa -zf - -O > ref/HiSeq_UCSC_hg19.fa
# alternatively, decompress the whole archive to its large original folder structure then fetch the file from the subfolder.

inspect the reference genome fasta file

# inspect the resulting file
head -5 ref/HiSeq_UCSC_hg19.fa
>chrM
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG
GTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTC
CTGCCTCATTCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTAAAGTGTGTTA
ATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCGCTTTCCACACAGACATC

obtain the full list of chromosomes present in a fasta file

# the string **'^>'** looks for fasta header lines starting with **'>'**
grep "^>" ref/HiSeq_UCSC_hg19.fa
>chrM
>chr1
>chr2
# ... (cut out)
>chr21
>chr22
>chrX
>chrY

download from the UCSC the size of all reference genome chromosomes

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
 "select chrom, size from hg19.chromInfo" > ref/full_hg19.genome
 
# inspect the obtained file with **cat**, **head**, **tail**, or **less**
head -5 ref/full_hg19.genome
chrom	size
chr1	249250621
chr2	243199373
chr3	198022430
chr4	191154276

subset the reference genome to keep only the full chromosomes

Non-classical chromosome contain a '_' symbol in their name. We can get rid of them at once with grep -v (grep exclusion mode)

grep -v "_" ref/full_hg19.genome > ref/hg19.genome

create a BWA index for the selected reference genome

Handicon.png no need to run this part, it was done for you and takes quite some time to complete

# first move to the ref folder
whereami=$(pwd)
cd ref
 
# then create the BWA index for our reference genome hg19
bwa index HiSeq_UCSC_hg19.fa
# [bwa_index] Pack FASTA... 31.63 sec
# [bwa_index] Construct BWT for the packed sequence...
# [BWTIncCreate] textLength=6191387966, availableWord=447648912
# [BWTIncConstructFromPacked] 10 iterations done. 99999998 characters processed.
# [BWTIncConstructFromPacked] 20 iterations done. 199999998 characters processed.
# ... some time and many lines later
# [BWTIncConstructFromPacked] 680 iterations done. 6177547390 characters processed.
# [bwt_gen] Finished constructing BWT in 687 iterations.
# [bwa_index] 3661.20 seconds elapse.
# [bwa_index] Update BWT... 38.29 sec
# [bwa_index] Pack forward-only FASTA... 20.68 sec
# [bwa_index] Construct SA from BWT and Occ... 1709.39 sec
# [main] Version: 0.7.4-r385
# [main] CMD: bwa index HiSeq_UCSC_hg19.fa
# [main] Real time: 6493.078 sec; CPU: 5459.750 sec
 
# also create a dictionary file for Picard tools
java -Xmx4g -jar $PICARD/CreateSequenceDictionary.jar \
	R=HiSeq_UCSC_hg19.fa \
	O=HiSeq_UCSC_hg19.dict
 
# the content of the reference folder is now
ls -lah
# total 8.0G
# drwxr-xr-x 12 bits bits  408 Aug 27 16:35 .
# drwxr-xr-x 15 bits bits  510 Aug 27 15:44 ..
# -rw-r--r--  1 bits bits 4.7K Aug 27 14:47 HiSeq_UCSC_hg19.dict
# -rw-r--r--  1 bits bits 3.0G Aug 27 14:42 HiSeq_UCSC_hg19.fa
# -rw-r--r--  1 bits bits 6.6K Aug 27 16:02 HiSeq_UCSC_hg19.fa.amb
# -rw-r--r--  1 bits bits  939 Aug 27 16:02 HiSeq_UCSC_hg19.fa.ann
# -rw-r--r--  1 bits bits 2.9G Aug 27 16:00 HiSeq_UCSC_hg19.fa.bwt
# -rw-r--r--  1 bits bits  783 Aug 27 14:44 HiSeq_UCSC_hg19.fa.fai
# -rw-r--r--  1 bits bits 739M Aug 27 16:03 HiSeq_UCSC_hg19.fa.pac
# -rw-r--r--  1 bits bits 1.5G Aug 27 16:37 HiSeq_UCSC_hg19.fa.sa
# -rw-r--r--  1 bits bits 2.0K Aug 27 14:43 full_hg19.genome
# -rw-r--r--  1 bits bits  387 Aug 27 15:07 hg19.genome
 
# move back to where you were
# tip: this can also be done with 'pushd ref' to start and 'popd' at the end (see man pages)
cd ${whereami}

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats | NGS Exercise.1 ]