NGS Exercise.3e

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 | NGS Exercise.3 | NGS Exercise.3b | NGS Exercise.3c | NGS Exercise.3d ]


Align paired end reads to the human reference genome hg19 using RTG 3.6.1
# added in 2016


rtg_logo.png

Why use the RTG software instead of the better known BWA, Bowtie2, or STAR?

RTG (RealTime Genomics) is a less well known aligner suite that was introduced in the end of the years 2010 and has since then pretty much evolved into a very complete product. RTG can be used free of charge for academia and we strongly support its use (please read more and download it from http://www.realtimegenomics.com/[1])

RTG v3.6.1 commands

Usage: rtg COMMAND [OPTION]...
       rtg RTG_MEM=16G COMMAND [OPTION]...  (e.g. to set maximum memory use to 16 GB)
 
Type 'rtg help COMMAND' for help on a specific command.
The following commands are available:
 
Data formatting:
	format        	convert sequence data files to SDF
	cg2sdf        	convert Complete Genomics reads to SDF
	sdf2fasta     	convert SDF to FASTA
	sdf2fastq     	convert SDF to FASTQ
	sdf2sam       	convert SDF to SAM/BAM
	sdf2cg        	convert SDF formatted CG data to CG TSV
 
Read mapping:
	map           	read mapping
	mapf          	read mapping for filtering purposes
	cgmap         	read mapping for Complete Genomics data
 
Protein search:
	mapx          	translated protein search
 
Assembly:
	assemble      	assemble reads into long sequences
	addpacbio     	add Pacific Biosciences reads to an assembly
 
Variant detection:
	calibrate     	create calibration data from SAM/BAM files
	svprep        	prepare SAM/BAM files for structural variant analysis
	sv            	find structural variants
	discord       	detect structural variant breakends using discordant reads
	coverage      	calculate depth of coverage from SAM/BAM files
	snp           	call variants from SAM/BAM files
	family        	call variants for a family following Mendelian inheritance
	somatic       	call variants for a tumor/normal pair
	population    	call variants for multiple potentially-related individuals
	lineage       	call de novo variants in a cell lineage
	avrbuild      	build an AVR model from training examples
	avrpredict    	run AVR on a VCF file
	cnv           	call CNVs from paired SAM/BAM files
 
Metagenomics:
	species       	estimate species frequency in metagenomic samples
	similarity    	calculate similarity matrix and nearest neighbor tree
 
Pipelines:
	composition-meta-pipeline
	              	run metagenomic composition pipeline
	functional-meta-pipeline
	              	run metagenomic functional pipeline
	composition-functional-meta-pipeline
	              	run metagenomic composition and functional pipelines
 
Simulation:
	genomesim     	generate simulated genome sequence
	cgsim         	generate simulated reads from a sequence
	readsim       	generate simulated reads from a sequence
	readsimeval   	evaluate accuracy of mapping simulated reads
	popsim        	generate a VCF containing simulated population variants
	samplesim     	generate a VCF containing a genotype simulated from a population
	childsim      	generate a VCF containing a genotype simulated as a child of two parents
	denovosim     	generate a VCF containing a derived genotype containing de novo variants
	samplereplay  	generate the genome corresponding to a sample genotype
 
Utility:
	bgzip         	compress a file using block gzip
	index         	create a tabix index
	extract       	extract data from a tabix indexed file
	aview         	ASCII read mapping and variant viewer
	sdfstats      	print statistics about an SDF
	sdfsplit      	split an SDF into multiple parts
	sdfsubset     	extract a subset of an SDF into a new SDF
	sdfsubseq     	extract a subsequence from an SDF as text
	sam2bam       	convert SAM file to BAM file and create index
	sammerge      	merge sorted SAM/BAM files
	samstats      	print statistics about a SAM/BAM file
	samrename     	rename read id to read name in SAM/BAM files
	mapxrename    	rename read id to read name in mapx output files
	chrstats      	check expected chromosome coverage levels from mapping calibration files
	mendelian     	check a multisample VCF for Mendelian consistency
	vcfstats      	print statistics about variants contained within a VCF file
	vcfmerge      	merge single-sample VCF files into a single multi-sample VCF
	vcffilter     	filter records within a VCF file
	vcfannotate   	annotate variants within a VCF file
	vcfsubset     	create a VCF file containing a subset of the original columns
	vcfeval       	evaluate called variants for agreement with a baseline variant set
	pedfilter     	filter and convert a pedigree file
	pedstats      	print information about a pedigree file
	avrstats      	print statistics about an AVR model
	rocplot       	plot ROC curves from vcfeval ROC data files
	ncbi2tax      	create RTG taxonomy from NCBI taxdump files
	taxfilter     	create and manipulate RTG taxonomy files
	taxstats      	summarize and verify the taxonomy in an SDF
	usageserver   	run a local server for collecting RTG command usage information
	version       	print version and license information
	license       	print license information for all commands
	help          	print this screen or help for specified command

Build the RTG index files from the HiSeq_UCSC_hg19.fa reference

The command creates hash files from the fasta sequence of the reference genome. As always, you are warmly advised to read the documentation. A new ENV variable was created in .bashrc and exported that points to the default location for rtg indexes. This variable was named RTG_INDEXES and is used below.

rtg format command arguments

Usage: rtg format [OPTION]... -o SDF FILE+
                  [OPTION]... -o SDF -I FILE
                  [OPTION]... -o SDF -l FILE -r FILE
 
Converts the contents of sequence data files (FASTA/FASTQ/SAM/BAM) into the RTG Sequence Data File (SDF)
format.
 
File Input/Output
  -f, --format=FORMAT            format of input (Must be one of [fasta, fastq, sam-se, sam-pe]) (Default
                                 is fasta)
  -I, --input-list-file=FILE     file containing a list of input read files (1 per line)
  -l, --left=FILE                left input file for FASTA/FASTQ paired end data
  -o, --output=SDF               name of output SDF
  -p, --protein                  input is protein. If this option is not specified, then the input is
                                 assumed to consist of nucleotides
  -q, --quality-format=FORMAT    format of quality data for fastq files (use sanger for Illumina 1.8+)
                                 (Must be one of [sanger, solexa, illumina])
  -r, --right=FILE               right input file for FASTA/FASTQ paired end data
      FILE+                      input sequence files. May be specified 0 or more times
 
Filtering
      --duster                   treat lower case residues as unknowns
      --exclude=STRING           exclude input sequences based on their name. If the input sequence
                                 contains the specified string then that sequence is excluded from the SDF.
                                 May be specified 0 or more times
      --select-read-group=STRING when formatting from SAM/BAM input, only include reads with this read
                                 group ID
      --trim-threshold=INT       trim read ends to maximise base quality above the given threshold
 
Utility
      --allow-duplicate-names    disable checking for duplicate sequence names
  -h, --help                     print help on command-line flag usage
      --no-names                 do not include name data in the SDF output
      --no-quality               do not include quality data in the SDF output
      --sam-rg=STRING|FILE       file containing a single valid read group SAM header line or a string in
                                 the form "@RG\tID:READGROUP1\tSM:BACT_SAMPLE\tPL:ILLUMINA"
rtg format -o HiSeq_UCSC_hg19 HiSeq_UCSC_hg19.fa

index building output

Formatting FASTA data
Processing "HiSeq_UCSC_hg19.fa"
 
Input Data
Files              : HiSeq_UCSC_hg19.fa
Format             : FASTA
Type               : DNA
Number of sequences: 25
Total residues     : 3095693983
Minimum length     : 16571
Maximum length     : 249250621
 
Output Data
SDF-ID             : 53b72eea-afb1-4f1d-8e98-4c916af3175d
Number of sequences: 25
Total residues     : 3095693983
Minimum length     : 16571
Maximum length     : 249250621

Build the RTG index files from the paired fastq read files

Now the turn for the reads which will also be converted to a hash. This is unusual with other programs which take fastq as input while RTG also encodes the reads for speed improvement.

rtg format \
    -f fastq -q sanger \
    -l reads/shuffled_PE_NA18507_GAIIx_100_chr21_2_1.fq.gz \
    -r reads/shuffled_PE_NA18507_GAIIx_100_chr21_2_2.fq.gz \
    --sam-rg="@RG\tID:NA18507\tSM:GAIIx-chr21-RTG_3.6.1\tPL:ILLUMINA" \
    -o hg19_rtg-mapping/NA18507_reads

index building output

Formatting paired-end FASTQ data
Processing left arm "reads/shuffled_PE_NA18507_GAIIx_100_chr21_2_1.fq.gz"
Processing right arm "reads/shuffled_PE_NA18507_GAIIx_100_chr21_2_2.fq.gz"
Too many sequences to check for duplicate sequence names.
 
Input Data
Files              : shuffled_PE_NA18507_GAIIx_100_chr21_2_1.fq.gz shuffled_PE_NA18507_GAIIx_100_chr21_2_2.fq.gz
Format             : FASTQ
Type               : DNA
Number of pairs    : 7495097
Number of sequences: 14990194
Total residues     : 1499019400
Minimum length     : 100
Maximum length     : 100
 
Output Data
SDF-ID             : 7f8c9245-f353-41cc-ab67-83ea559b3bce
Number of pairs    : 7495097
Number of sequences: 14990194
Total residues     : 1499019400
Minimum length     : 100
Maximum length     : 100

Map reads to the reference

The mapping command can be tuned to fit specific needs, we use here default parameters which are often good enough for a diploid genome and the kind of application described in this training.

rtg map command arguments

Usage: rtg map [OPTION]... -o DIR -t SDF -i SDF|FILE
               [OPTION]... -o DIR -t SDF -l FILE -r FILE
 
Aligns sequence reads onto a reference template, creating an alignments file in the Sequence Alignment/Map
(SAM) format.
 
File Input/Output
  -F, --format=FORMAT              input format for reads (Must be one of [sdf, fasta, fastq, sam-se,
                                   sam-pe]) (Default is sdf)
  -i, --input=SDF|FILE             input read set
  -l, --left=FILE                  left input file for FASTA/FASTQ paired end reads
      --no-merge                   output mated/unmated/unmapped alignments into separate SAM/BAM files
  -o, --output=DIR                 directory for output
  -q, --quality-format=FORMAT      format of quality data for fastq files (use sanger for Illumina 1.8+)
                                   (Must be one of [sanger, solexa, illumina])
  -r, --right=FILE                 right input file for FASTA/FASTQ paired end reads
      --sam                        output the alignment files in SAM format
  -t, --template=SDF               SDF containing template to map against
 
Sensitivity Tuning
      --aligner-band-width=FLOAT   aligner indel band width scaling factor, fraction of read length allowed
                                   as an indel (Default is 0.5)
      --aligner-mode=STRING        pick the aligner to use (Must be one of [auto, table, general]) (Default
                                   is auto)
      --bed-regions=FILE           restrict calibration to mappings falling within the supplied BED regions
      --gap-extend-penalty=INT     penalty for a gap extension during alignment (Default is 1)
      --gap-open-penalty=INT       penalty for a gap open during alignment (Default is 19)
  -c, --indel-length=INT           guaranteed number of positions that will be detected in a single indel
                                   (Default is 1)
  -b, --indels=INT                 guaranteed minimum number of indels which will be detected (Default is
                                   1)
  -M, --max-fragment-size=INT      maximum permitted fragment size when mating paired reads (Default is
                                   1000)
  -m, --min-fragment-size=INT      minimum permitted fragment size when mating paired reads (Default is 0)
      --mismatch-penalty=INT       penalty for a mismatch during alignment (Default is 9)
  -d, --orientation=STRING         orientation for proper pairs (Must be one of [fr, rf, tandem, any])
                                   (Default is any)
      --pedigree=FILE              genome relationships pedigree containing sex of sample
      --repeat-freq=INT            maximum repeat frequency (Default is 90%)
      --sex=SEX                    sex of sample (Must be one of [male, female, either])
      --soft-clip-distance=INT     soft clip alignments if indels occur INT bp from either end (Default is
                                   5)
  -s, --step=INT                   step size (Default is word size)
  -a, --substitutions=INT          guaranteed minimum number of substitutions which will be detected
                                   (Default is 1)
      --unknowns-penalty=INT       penalty for unknown nucleotides during alignment (Default is 5)
  -w, --word=INT                   word size (Default is 22, or read length / 2, whichever is smaller)
 
Filtering
      --end-read=INT               exclusive upper bound on read id
      --start-read=INT             inclusive lower bound on read id
 
Reporting
      --all-hits                   output all alignments meeting thresholds instead of applying mating and
                                   N limits
      --max-mated-mismatches=INT   maximum mismatches for mappings across mated results, alias for
                                   --max-mismatches (as absolute value or percentage of read length)
                                   (Default is 10%)
  -e, --max-mismatches=INT         maximum mismatches for mappings in single-end mode (as absolute value or
                                   percentage of read length) (Default is 10%)
  -n, --max-top-results=INT        maximum number of top equal results output per read (Default is 5)
  -E, --max-unmated-mismatches=INT maximum mismatches for mappings of unmated results (as absolute value or
                                   percentage of read length) (Default is 10%)
      --no-unmapped                do not output unmapped reads
      --no-unmated                 do not output unmated reads when in paired-end mode
      --sam-rg=STRING|FILE         file containing a single valid read group SAM header line or a string in
                                   the form "@RG\tID:READGROUP1\tSM:BACT_SAMPLE\tPL:ILLUMINA"
      --top-random                 output a single random top hit per read
 
Utility
  -h, --help                       print help on command-line flag usage
      --legacy-cigars              use legacy cigars in output
      --no-calibration             do not produce calibration files
  -Z, --no-gzip                    do not gzip the output
      --no-index                   do not produce indexes for output files
      --no-svprep                  do not perform structural variant processing
      --read-names                 use read name in output instead of read id (Uses more RAM)
      --tempdir=DIR                directory used for temporary files (Defaults to output directory)
  -T, --threads=INT                number of threads (Default is the number of available cores)
rtg map -i hg19_rtg-mapping/NA18507_reads -o hg19_rtg-mapping/mapping-results -t $RTG_INDEXES/HiSeq_UCSC_hg19

rtg map output

ARM MAPPINGS
    left    right     both
 6877822  6877822 13755644  91.8% mated uniquely (NH = 1)
   10938    10938    21876   0.1% mated ambiguously (NH > 1)
  262904   164976   427880   2.9% unmated uniquely (NH = 1)
    6167     5480    11647   0.1% unmated ambiguously (NH > 1)
       0        0        0   0.0% unmapped due to read frequency (XC = B)
   46318    51570    97888   0.7% unmapped with no matings but too many hits (XC = C)
   28596    70149    98745   0.7% unmapped with poor matings (XC = d)
     129      141      270   0.0% unmapped with too many matings (XC = e)
   31316    33709    65025   0.4% unmapped with no matings and poor hits (XC = D)
       0        0        0   0.0% unmapped with no matings and too many good hits (XC = E)
  230907   280312   511219   3.4% unmapped with no hits
 7495097  7495097 14990194 100.0% total
RTG mapping plots]
Alignment-Score-Distribution-(Mated-Only)
Alignment-Score-Distribution-(Mated-Only).png
Alignment-Score-Distribution-(Unmated-Only)
Alignment-Score-Distribution-(Unmated-Only).png
Fragment-Length-Distribution
Fragment-Length-Distribution.png
Mapping-Quality-(MAPQ)-Distribution
Mapping-Quality-(MAPQ)-Distribution.png

Call variants by comparing mappings to the reference genome sequence

The many RTG applications comprise:

  • calling SNV variants with rtg snp
  • preparing structural variant analysis with rtg svprep
  • calling structural variants with rtg sv
  • finding chromosomal breakpoints with rtg discord
  • analysis of coverage with rtg coverage
  • call CNV with rtg cnv

Other type of analysis are present in RTG and we strongly suggest to look at them, they include somatic, family, population, and many more...

We will only apply here the first tool to obtain calls for snps and short indels. We apply default parameters to that run too. Please look at the many possible additional arguments below.

rtg snp command arguments

Usage: rtg snp [OPTION]... -o DIR -t SDF FILE+
               [OPTION]... -o DIR -t SDF -I FILE
 
Calls sequence variants, such as single nucleotide polymorphisms (SNPs), multi-nucleotide polymorphisms (MNPs) and Indels, from a
set of alignments reported in co-ordinate sorted SAM/BAM files.
 
File Input/Output
  -I, --input-list-file=FILE          file containing a list of SAM/BAM format files (1 per line) containing mapped reads
      --no-calibration                if set, ignore mapping calibration files
  -o, --output=DIR                    directory for output
  -t, --template=SDF                  SDF of the reference genome the reads have been mapped against
      FILE+                           SAM/BAM format files containing mapped reads. May be specified 0 or more times
 
Sensitivity Tuning
      --bed-regions=FILE              BED file containing regions to process
      --exclude-mated                 exclude all mated SAM records
      --exclude-unmated               exclude all unmated SAM records
      --keep-duplicates               don't detect and filter duplicate reads based on mapping position
  -m, --machine-errors=STRING         if set, force sequencer machine settings. One of [default, illumina, ls454_se, ls454_pe,
                                      complete, iontorrent]
      --max-as-mated=INT              if set, ignore mated SAM records with an alignment score (AS attribute) that exceeds this
                                      value
      --max-as-unmated=INT            if set, ignore unmated SAM records with an alignment score (AS attribute) that exceeds this
                                      value
      --max-coverage=INT              skip calling in sites with combined depth exceeding this fixed value (Default is 200)
      --max-coverage-multiplier=FLOAT skip calling in sites with combined depth exceeding multiplier * average combined coverage
                                      determined from calibration (Default is 5.0)
      --max-hits=INT                  if set, ignore SAM records with an alignment count that exceeds this value
      --min-mapq=INT                  if set, ignore SAM records with MAPQ less than this value
  -p, --pedigree=FILE                 genome relationships PED file containing sex of individual
      --ploidy=STRING                 ploidy to use when the template does not contain a reference text file (Must be one of
                                      [diploid, haploid]) (Default is diploid)
      --population-priors=FILE        if set, use the VCF file to generate population based site-specific priors
      --rdefault-mated=INT            for mated reads that have no mapping quality supplied use this as the default quality (in
                                      Phred format from 0 to 63) (Default is 20)
      --rdefault-unmated=INT          for unmated reads that have no mapping quality supplied use this as the default quality (in
                                      Phred format from 0 to 63) (Default is 20)
      --region=STRING                 if set, only process SAM records within the specified range. The format is one of
                                      <template_name>, <template_name>:start-end or <template_name>:start+length
      --sex=SEX                       sex of individual (Must be one of [male, female, either]) (Default is either)
 
Reporting
  -a, --all                           write variant calls covering every position irrespective of thresholds
      --avr-model=FILE                name of AVR model to use when scoring variants (Must be one of [none, alternate.avr,
                                      illumina-exome.avr, illumina-wgs.avr] or a path to a model file) (Default is
                                      illumina-exome.avr)
      --filter-ambiguity=INT          threshold for ambiguity filter applied to output variants
      --filter-bed=FILE               apply a position based filter, retaining only variants that fall in these BED regions
      --filter-depth=INT              apply a fixed depth of coverage filter to output variants
      --filter-depth-multiplier=FLOAT apply a ratio based depth filter. The filter will be multiplier * average coverage determined
                                      from calibration files
      --min-avr-score=FLOAT           if set, fail variants with AVR scores below this value
      --snps-only                     if set, will output simple SNPs only
 
Utility
  -h, --help                          print help on command-line flag usage
  -Z, --no-gzip                       do not gzip the output
      --no-index                      do not produce indexes for output files
  -T, --threads=INT                   number of threads (Default is the number of available cores)
rtg snp -t $RTG_INDEXES/HiSeq_UCSC_hg19 -o hg19_rtg-mapping/snp-results hg19_rtg-mapping/mapping-results/alignments.bam

rtg snp output

Failed Filters               : 78
Excessive Coverage           : 1651
No Hypotheses                : 59
Passed Filters               : 106685
Complex Called               : 27697
SNPs                         : 80247
MNPs                         : 1816
Insertions                   : 6111
Deletions                    : 6884
Indels                       : 2607
Same as reference            : 9020
SNP Transitions/Transversions: 1.91 (72552/38028)
Total Het/Hom ratio          : 1.74 (62083/35582)
SNP Het/Hom ratio            : 1.66 (50034/30213)
MNP Het/Hom ratio            : 0.76 (782/1034)
Insertion Het/Hom ratio      : 2.07 (4118/1993)
Deletion Het/Hom ratio       : 2.92 (5128/1756)
Indel Het/Hom ratio          : 3.45 (2021/586)
Insertion/Deletion ratio     : 0.89 (6111/6884)
Indel/SNP+MNP ratio          : 0.19 (15602/82063)

In addition, a region file is provided in BED format that details the full genome in regions of reference calls, no calls and variant calls which are quite unique and very valuable.

rtg region details in the first part of chr21

chr21   9464400 9464408 complex-no-variant
chr21   9465822 9465832 complex-no-variant
chr21   9467415 9467417 complex-called
chr21   9467596 9467605 complex-called
chr21   9473185 9473193 complex-called
chr21   9475390 9475398 complex-called
chr21   9475843 9475846 complex-no-variant
chr21   9476539 9476548 complex-called
chr21   9476619 9476621 complex-called
chr21   9477001 9477016 complex-called
chr21   9477046 9477053 complex-called
chr21   9477344 9477349 complex-called
chr21   9477442 9477449 complex-called
chr21   9477635 9477645 complex-called
chr21   9477660 9477669 complex-called
chr21   9477953 9477955 complex-called
chr21   9478023 9478027 complex-called
chr21   9478287 9478289 complex-called
chr21   9478357 9478358 complex-no-variant
chr21   9478562 9478565 complex-called
chr21   9478714 9478717 complex-called
...
 
## the data is for chr21 divided into:
  19476 complex-called
     57 complex-no-hypotheses
   1675 complex-no-variant
      5 complex-over-coverage

download exercise files

Download exercise files here

Use the right application to open the files present in ex3e-files

References:
  1. http://www.realtimegenomics.com/

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 | NGS Exercise.3 | NGS Exercise.3b | NGS Exercise.3c | NGS Exercise.3d ]