NGS-Var2020 Exercise.2

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.1 | NGS-Var2020 Exercise.3 ]


Align paired end reads to the human reference genome GRCh38 using the Burrow Wheeler Aligner (BWA)


ex02_wf.png

Introduction

Reference mapping is the process applied to NGS reads when the reference genome is available. Mapping (aligning) reads to the reference is required in order to later pileup all alignment results and search for variants at each conflicting position. In the mapping step, each read is aligned to the reference genome and the genome coordinate of the best hit(s) is(are) stored together with the read sequence and quality parameters in a SAM/BAM file. This is the most time consuming step of NGS analysis and its quality and completeness will condition all downstream processes.

Handicon.png Full mapping of an average human NGS Illumina dataset (100M read pairs) will take several days and use full computer power on a 48cpu computer with 48GB RAM (values are indicative).

prepare the reference genome for BWA alignment

BWA aligns reads to a library of possible short nucleotides (hash table). A hash table is build once for each new reference genome using one of BWA commands.

This step was performed for you and the reference index saved to the GenePattern server under the name GRCh38.

Handicon.png Nine takeways about GRCh38 from a Broad page

Align the reads in pairs to the reference genome using the bwa mem algorithm

Handicon.png We will do this step using the smallest 1% sample and not the full data in order to speed up the process

  • start the BWA mem module
  • in the 'input' parameter group, link the reference and the two 1% read files (0.01) in the corresponding fields
ex2_001.png
  • review the optional settings but do not change the defaults

use the text below to fill the text fields:

https://data.bits.vib.be/pub/trainingen/NGSVAR2020/reads/HG001.GRCh38_chr22_0.01_1.fq.gz
https://data.bits.vib.be/pub/trainingen/NGSVAR2020/reads/HG001.GRCh38_chr22_0.01_2.fq.gz
@RG ID:HG001 LB:NA12878_giab PU:unknown-0.0 PL:Illumina SM:NA12878
HG001_1pc
  • run and wait for results, you should get a job as shown next
ex2_03.png

Prepare the mappings for variant calling with various Picard tools grouped in GATK.BamPreprocess

The presence of PCR duplicates in NGS libraries can cause false positive variant calls at later stages. For this reason, the duplicates present after mapping to the reference genome must be marked using the dedicated Picard tool.

Technical.png You can read more about read duplicates in our [Q&A section about read duplicates]

Technical.png Read Name regex: Reads are expected to be named in the following convention machine:run:flowcell:lane:tile:x_coord:y_coord

  • start the GATK.BamPreprocess module
  • point the tool to the SAM file obtained after BWA mapping then review and adapt the other fields as shown
ex2_04.png
ex2_05.png
  • run and wait for the results
ex2_06.png
  • open the 'HG001_1pc.duplicatesmetrics.txt' summary text file to get numbers

Technical.png The metrics are explained HERE

Results for the 1% run

## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates  --INPUT NA12878_0.01_1.fq.querysorted.bam --OUTPUT NA12878_0.01_1.fq.marked.bam --METRICS_FILE NA12878_0.01_1.fq.duplicatesmetrics.txt --ASSUME_SORT_ORDER queryname --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --TMP_DIR TMP --QUIET true --VALIDATION_STRINGENCY SILENT  --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
## htsjdk.samtools.metrics.StringHeader
# Started on: Mon Dec 30 09:24:45 CET 2019
 
## METRICS CLASS	picard.sam.DuplicationMetrics
LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES	READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION	ESTIMATED_LIBRARY_SIZE
NA12878_01pc	2284	340118	2599	2284	34	20	0	0.000108	2891892954
 
## HISTOGRAM	java.lang.Double
BIN	CoverageMult	all_sets	non_optical_sets
1.0	1	340078	340078
2.0	1.999882	20	20
3.0	2.999647	0	0
4.0	3.999294	0	0
5.0	4.998824	0	0
6.0	5.998236	0	0
7.0	6.997531	0	0
8.0	7.996708	0	0
9.0	8.995767	0	0
10.0	9.994709	0	0
11.0	10.993534	0	0
12.0	11.992241	0	0
13.0	12.990831	0	0
14.0	13.989303	0	0
15.0	14.987658	0	0
16.0	15.985895	0	0
17.0	16.984015	0	0
18.0	17.982018	0	0
19.0	18.979903	0	0
20.0	19.977671	0	0
21.0	20.975322	0	0
22.0	21.972855	0	0
23.0	22.970271	0	0
24.0	23.967569	0	0
25.0	24.964751	0	0
26.0	25.961815	0	0
27.0	26.958761	0	0
28.0	27.955591	0	0
29.0	28.952303	0	0
30.0	29.948898	0	0
31.0	30.945376	0	0
32.0	31.941737	0	0
33.0	32.937981	0	0
34.0	33.934107	0	0
35.0	34.930116	0	0
36.0	35.926008	0	0
37.0	36.921783	0	0
38.0	37.917441	0	0
39.0	38.912982	0	0
40.0	39.908405	0	0
41.0	40.903712	0	0
42.0	41.898902	0	0
43.0	42.893974	0	0
44.0	43.88893	0	0
45.0	44.883768	0	0
46.0	45.87849	0	0
47.0	46.873094	0	0
48.0	47.867582	0	0
49.0	48.861952	0	0
50.0	49.856206	0	0
51.0	50.850343	0	0
52.0	51.844362	0	0
53.0	52.838265	0	0
54.0	53.832051	0	0
55.0	54.82572	0	0
56.0	55.819273	0	0
57.0	56.812708	0	0
58.0	57.806027	0	0
59.0	58.799229	0	0
60.0	59.792314	0	0
61.0	60.785282	0	0
62.0	61.778133	0	0
63.0	62.770868	0	0
64.0	63.763486	0	0
65.0	64.755987	0	0
66.0	65.748371	0	0
67.0	66.740639	0	0
68.0	67.73279	0	0
69.0	68.724824	0	0
70.0	69.716742	0	0
71.0	70.708543	0	0
72.0	71.700228	0	0
73.0	72.691795	0	0
74.0	73.683246	0	0
75.0	74.674581	0	0
76.0	75.665799	0	0
77.0	76.6569	0	0
78.0	77.647885	0	0
79.0	78.638754	0	0
80.0	79.629505	0	0
81.0	80.620141	0	0
82.0	81.610659	0	0
83.0	82.601062	0	0
84.0	83.591347	0	0
85.0	84.581517	0	0
86.0	85.57157	0	0
87.0	86.561506	0	0
88.0	87.551326	0	0
89.0	88.54103	0	0
90.0	89.530617	0	0
91.0	90.520088	0	0
92.0	91.509442	0	0
93.0	92.49868	0	0
94.0	93.487802	0	0
95.0	94.476808	0	0
96.0	95.465697	0	0
97.0	96.45447	0	0
98.0	97.443126	0	0
99.0	98.431667	0	0
100.0	99.420091	0	0

Handicon.png You can make the top table (## METRICS CLASS section) much more readable by 'transposing it' under unix or with this nice webtool: https://www.browserling.com/tools/text-transpose

LIBRARY NA12878_giab
UNPAIRED_READS_EXAMINED 229105
READ_PAIRS_EXAMINED 33980150
SECONDARY_OR_SUPPLEMENTARY_RDS 262641
UNMAPPED_READS 229109
UNPAIRED_READ_DUPLICATES 141832
READ_PAIR_DUPLICATES 151566
READ_PAIR_OPTICAL_DUPLICATES 3728
PERCENT_DUPLICATION 0.006525
ESTIMATED_LIBRARY_SIZE 3892930596

The covariate analysis results can be reviewed in the HG001_1pc.covariates.pdf and HG001_10pc.covariates.pdf PDF files.

download exercise files

Download exercise files here

Use the right application to open the files present in ex2-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.1 | NGS-Var2020 Exercise.3 ]