NGS-Var2018 Exercise.2

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2018 | NGS-Var2018 Exercise.1 | NGS-Var2018 Exercise.3 ]


Align paired end reads to the human reference genome GRCh38 using the Burrow Wheeler Aligner (BWA)


ex02_wf.png

Introduction

Reference mapping is the process applied to NGS reads when the reference genome is available. Mapping (aligning) reads to the reference is required in order to later pileup all alignment results and search for variants at each conflicting position. In the mapping step, each read is aligned to the reference genome and the genome coordinate of the best hit(s) is(are) stored together with the read sequence and quality parameters in a SAM/BAM file. This is the most time consuming step of NGS analysis and its quality and completeness will condition all downstream processes.

Handicon.png Full mapping of an average human NGS Illumina dataset (100M read pairs) will take several days and use full computer power on a 48cpu computer with 48GB RAM (values are indicative).

prepare the reference genome for BWA alignment

BWA aligns reads to a library of possible short nucleotides (hash table). A hash table is build once for each new reference genome using one of BWA commands.

This step was performed for you and the reference index saved to the GenePattern server under the name GRCh38.

Handicon.png Nine takeways about GRCh38 from a Broad page

Align the reads in pairs to the reference genome using the bwa mem algorithm

Handicon.png We will do this step using the smallest 1% sample and not the full data in order to speed up the process

  • start the BWA mem module
  • in the 'input' parameter group, link the reference and the two 1% read files (0.01) in the corresponding fields
ex2_001.png
  • review the optional settings but do not change the defaults

use the text below to fill the text fields:

@RG ID:HG001 LB:NA12878_giab PU:unknown-0.0 PL:Illumina SM:NA12878
HG001_1pc
  • run and wait for results, you should get a job as shown next
ex2_03.png

Prepare the mappings for variant calling with various Picard tools grouped in GATK.BamPreprocess

The presence of PCR duplicates in NGS libraries can cause false positive variant calls at later stages. For this reason, the duplicates present after mapping to the reference genome must be marked using the dedicated Picard tool.

Technical.png You can read more about read duplicates in our [Q&A section about read duplicates]

Technical.png Read Name regex: Reads are expected to be named in the following convention machine:run:flowcell:lane:tile:x_coord:y_coord

  • start the GATK.BamPreprocess module
  • point the tool to the SAM file obtained after BWA mapping then review and adapt the other fields as shown
ex2_04.png
ex2_05.png
  • run and wait for the results
ex2_06.png
  • open the 'HG001_1pc.duplicatesmetrics.txt' summary text file to get numbers

Technical.png The metrics are explained HERE

Results for the 1% run

## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates  --INPUT HG001_1pc.querysorted.bam --OUTPUT HG001_1pc.marked.bam --METRICS_FILE HG001_1pc.duplicatesmetrics.txt --ASSUME_SORT_ORDER queryname --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --TMP_DIR TMP --QUIET true --VALIDATION_STRINGENCY SILENT  --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
## htsjdk.samtools.metrics.StringHeader
# Started on: Wed Oct 10 15:39:55 CEST 2018
 
## METRICS CLASS	picard.sam.DuplicationMetrics
LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES	READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION	ESTIMATED_LIBRARY_SIZE
NA12878_giab	2284	340118	2599	2284	34	20	0	0.000108	2891892954
 
## HISTOGRAM	java.lang.Double
BIN	VALUE
1.0	1
2.0	1.999882
3.0	2.999647
4.0	3.999294
5.0	4.998824
6.0	5.998236
7.0	6.997531
8.0	7.996708
9.0	8.995767
10.0	9.994709
11.0	10.993534
12.0	11.992241
13.0	12.990831
14.0	13.989303
15.0	14.987658
16.0	15.985895
17.0	16.984015
18.0	17.982018
19.0	18.979903
20.0	19.977671
21.0	20.975322
22.0	21.972855
23.0	22.970271
24.0	23.967569
25.0	24.964751
26.0	25.961815
27.0	26.958761
28.0	27.955591
29.0	28.952303
30.0	29.948898
31.0	30.945376
32.0	31.941737
33.0	32.937981
34.0	33.934107
35.0	34.930116
36.0	35.926008
37.0	36.921783
38.0	37.917441
39.0	38.912982
40.0	39.908405
41.0	40.903712
42.0	41.898902
43.0	42.893974
44.0	43.88893
45.0	44.883768
46.0	45.87849
47.0	46.873094
48.0	47.867582
49.0	48.861952
50.0	49.856206
51.0	50.850343
52.0	51.844362
53.0	52.838265
54.0	53.832051
55.0	54.82572
56.0	55.819273
57.0	56.812708
58.0	57.806027
59.0	58.799229
60.0	59.792314
61.0	60.785282
62.0	61.778133
63.0	62.770868
64.0	63.763486
65.0	64.755987
66.0	65.748371
67.0	66.740639
68.0	67.73279
69.0	68.724824
70.0	69.716742
71.0	70.708543
72.0	71.700228
73.0	72.691795
74.0	73.683246
75.0	74.674581
76.0	75.665799
77.0	76.6569
78.0	77.647885
79.0	78.638754
80.0	79.629505
81.0	80.620141
82.0	81.610659
83.0	82.601062
84.0	83.591347
85.0	84.581517
86.0	85.57157
87.0	86.561506
88.0	87.551326
89.0	88.54103
90.0	89.530617
91.0	90.520088
92.0	91.509442
93.0	92.49868
94.0	93.487802
95.0	94.476808
96.0	95.465697
97.0	96.45447
98.0	97.443126
99.0	98.431667
100.0	99.420091

Handicon.png You can make the top table (## METRICS CLASS section) much more readable by 'transposing it' under unix or with this nice webtool: https://www.browserling.com/tools/text-transpose

LIBRARY NA12878_giab
UNPAIRED_READS_EXAMINED 2284
READ_PAIRS_EXAMINED 340118
SECONDARY_OR_SUPPLEMENTARY_RDS 2599
UNMAPPED_READS 2284
UNPAIRED_READ_DUPLICATES 34
READ_PAIR_DUPLICATES 20
READ_PAIR_OPTICAL_DUPLICATES 0
PERCENT_DUPLICATION 0.000108
ESTIMATED_LIBRARY_SIZE 2891892954

The covariate analysis results can be reviewed in the HG001_1pc.covariates.pdf and HG001_10pc.covariates.pdf PDF files.

download exercise files

Download exercise files here

Use the right application to open the files present in ex2-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2018 | NGS-Var2018 Exercise.1 | NGS-Var2018 Exercise.3 ]