NGS Exercise.1

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 ]


Download Illumina data - perform BAM QC - correct SAM/BAM formatting errors

This exercise was removed from the session program but is left to teach you how to recycle public NGS data from the internet.


ex01_wf.png

Introduction

In this first session, we will inspect the BAM data and perform a number of controls to detect formatting errors and fix them when pssible.

The Illumina shared 'True Seq' resources page [1] provides various datasets and reference files used for this training. Obtaining data and installing tools is not covered in this session. If you wish to reproduce the environment available on your laptop today, please contact us.

The chr21 data was obtained from the ftp link [2]. This file is about 1GB in size and provides all reads mapped to chr21 (hg18) using the Illumina CASAVA suite in BAM format. As we will see in this exercise, the Illumina BAM file presents serious formatting issues that will be interesting to evaluate and fix during this session and get a general idea of this kind of maintenance tasks.

The sequencing data used in this session was obtained from gDNA extracted from EBV-transformed B-lymphocytes of a healthy Nigerian individual (NA18507). A lot of information is available for that sample from the Coriell repository where the gDNA is archived [3].

Cli tools.png Precomputed results of this exercise have been moved to Final_Results/ori-Illumina_hg18_mapping_results/

download the original bam file in terminal

The following command will download the data in the current folder.

Error creating thumbnail: Unable to save thumbnail to destination
This was already done for you so no need to run this command during the session
infolder="ori-Illumina_hg18_mapping"
mkdir -p ${infolder}
 
# wget allows downloading files from the internet
# when you have the full URL to a file, you can download it (even if it is on a FTP server, read the 'man wget' pages)
wget -P ${infolder} \
 ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/NA18507_GAIIx_100_chr21.bam

what did we learn here?

Cli tools.png Using -p after mkdir allows:

  • creating a folder when it does not exist yet
  • create subfolder in that folder in one go (mkdir -p first/second/third-level where 'first', 'second', and 'third-level' are created, nested in each other)
  • leave existing folder and/or subfolder unchanged if they already existed ('-p' this will not trigger errors when trying to create an already existing folder

inspect BAM content

While SAM data can be inspected using basic unix commands (head, less, grep, ...), BAM format instead is compressed and requires a specialized parser to read the file. Two broadly used tools to view BAM data are samtools and its java derivative picard.

Error creating thumbnail: Unable to save thumbnail to destination
after converting data to plain text viewable on the terminal screen, you can pipe the result to any unix tool to the stream including head (used below), more, grep, ...

samtools comprises only few distinct tools collapsed here

Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.19+

Usage:   samtools <command> [options]

Command: view        SAM<->BAM conversion
         sort        sort alignment file
         mpileup     multi-way pileup
         depth       compute the depth
         faidx       index/extract FASTA
         tview       text alignment viewer
         index       index alignment
         idxstats    BAM index stats (r595 or later)
         fixmate     fix mate information
         flagstat    simple stats
         calmd       recalculate MD/NM tags and '=' bases
         merge       merge sorted alignments
         rmdup       remove PCR duplicates
         reheader    replace BAM header
         cat         concatenate BAMs
         bedcov      read depth per BED region
         targetcut   cut fosmid regions (for fosmid pool only)
         phase       phase heterozygotes
         bamshuf     shuffle and group alignments by name

picard, by contrast, comprises many more specific tools (below an overview for the current version)

## how to run Picard tools in recent versions ?
# provided PICARD is defined and points to the place where the picard.jar file is located, you can type:
java -jar $PICARD/picard.jar <command-name> <command-options...>

## current sorted list of Picard commands:
AddCommentsToBam.jar, AddOrReplaceReadGroups.jar, BamIndexStats.jar, BamToBfq.jar, BuildBamIndex.jar,
CalculateHsMetrics.jar, CheckIlluminaDirectory.jar, CleanSam.jar, CollectAlignmentSummaryMetrics.jar,
CollectGcBiasMetrics.jar, CollectInsertSizeMetrics.jar, CollectMultipleMetrics.jar, CollectRnaSeqMetrics.jar,
CollectTargetedPcrMetrics.jar, CollectWgsMetrics.jar, CompareSAMs.jar, CreateSequenceDictionary.jar,
DownsampleSam.jar, EstimateLibraryComplexity.jar, ExtractIlluminaBarcodes.jar, ExtractSequences.jar,
FastqToSam.jar, FilterSamReads.jar, FixMateInformation.jar, GatherBamFiles.jar, IlluminaBasecallsToFastq.jar,
IlluminaBasecallsToSam.jar, IntervalListTools.jar, MakeSitesOnlyVcf.jar, MarkDuplicates.jar, MarkIlluminaAdapters.jar,
MeanQualityByCycle.jar, MergeBamAlignment.jar, MergeSamFiles.jar, MergeVcfs.jar, NormalizeFasta.jar,
QualityScoreDistribution.jar, ReorderSam.jar, ReplaceSamHeader.jar, RevertOriginalBaseQualitiesAndAddMateCigar.jar,
RevertSam.jar, SamFormatConverter.jar, SamToFastq.jar, SortSam.jar, SplitVcfs.jar, ValidateSamFile.jar,
VcfFormatConverter.jar, ViewSam.jar

view the top 5 lines of a bam file with samtools

infolder="ori-Illumina_hg18_mapping"
 
samtools view ${infolder}/NA18507_GAIIx_100_chr21.bam | \
    head -5
 
samtools view ${infolder}/NA18507_GAIIx_100_chr21.bam |     head -5
EAS51_0210:3:6:3797:7459        89      chr21   9719702 73      100M    *       0       0       ATATTTGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTC    DDC@BEEEEEEGEEBFGEGG@EEDDBEEGEGGGGFFFFGEECGFGGEGGGGGGGGDFGGGFFGGGGFGFEGGGFGGGGGGBGGEGGFGGGGGGGGGGGGG    XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN34       SM:i:73 AS:i:0
EAS51_0210:3:6:3797:7459        165     chr21   9719702 255     *       *       0       0       AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT    GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC    H0:i:0  H1:i:0 H2:i:2   SM:i:-1 AS:i:0
EAS51_0210:7:33:5109:13959      145     chr21   9719707 254     100M    chrY    10653706        0       TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTC    FEGDGEGEEEEGFEGEEGEEDFEGGGGGGEFGFGFAFFFEGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG    XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T2  SM:i:461        AS:i:0
EAS25_0078:8:23:14907:11377     89      chr21   9719708 254     100M    *       0       0       TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCA    EGEEEEGEBGFGGEEEEEGGDEBGFGGGGGGGGGGAGFGGGEGGGGGGGGGGGGGGGGGGGFGGGGGGGGEEGGGGGGGGGGGGGGGGGGGGGGGGGGGG    XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T3   SM:i:461        AS:i:0
EAS25_0078:8:23:14907:11377     165     chr21   9719708 255     *       *       0       0       CTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATATCTTCCCATAAAAACTAGACAGAAGGTTTCTAAGAA    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGEFGGGGGGGGGEGGGGGGGGGGFGFGDGGGGGFDFBFGCEBFFFEGFF    H0:i:1  H1:i:6 H2:i:60  SM:i:-1 AS:i:0


what do you see in these lines?

  • the five lines refer to three read pairs but one pair has only one read => The second read is missing and has probably been removed by Illumina to generate this chr21 subset
  • the read names are identical within a pair => normally, paired reads end with /1 and /2 respectively

Error creating thumbnail: Unable to save thumbnail to destination
only read lines are shown and the header is excluded.
To also get the header, add -h after view. To see ONLY the header replace -h by -H
infolder="ori-Illumina_hg18_mapping"
 
# print only the header
samtools view ${infolder}/NA18507_GAIIx_100_chr21.bam -H 
 
# print the header followed by ALL read lines (ie: the full BAM content !!!)
samtools view ${infolder}/NA18507_GAIIx_100_chr21.bam -h | more

what did we learn here?

Cli tools.png samtools view allows:

  • viewing the BAM data on screen or send it to a pipe
  • also viewing the BAM header when using the '-h' argument
  • only viewing the header of the file when using the '-H' argument

view the same top 5 lines of a bam file with picard ViewSam

We added the @PG line of header seen with -H to the count to come to 6. VALIDATION_STRINGENCY=SILENT was also added to avoid getting error messages about inconsistent format.

infolder="ori-Illumina_hg18_mapping"
infile="NA18507_GAIIx_100_chr21.bam"
 
java -jar $PICARD/picard.jar ViewSam \
     I=${infolder}/${infile} \
     VALIDATION_STRINGENCY=SILENT \
     QUIET=TRUE \
     | head -6
 
@PG     ID:CASAVA       VN:CASAVA-1.7.0 CL:/home/csaunders/devel/CASAVA_20091209/install_main/bin/run.pl -p . --targets bam --bamWholeGenome --bamChangeChromLabels=UCSC -sa --jobsLimit=16
EAS51_0210:3:6:3797:7459        89      chr21   9719702 73      100M    *       0       0       ATATTTGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTC    DDC@BEEEEEEGEEBFGEGG@EEDDBEEGEGGGGFFFFGEECGFGGEGGGGGGGGDFGGGFFGGGGFGFEGGGFGGGGGGBGGEGGFGGGGGGGGGGGGG    XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN34       SM:i:73 AS:i:0
EAS51_0210:3:6:3797:7459        165     chr21   9719702 255     *       *       0       0       AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT    GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC    H0:i:0  H1:i:0 H2:i:2   SM:i:-1 AS:i:0
EAS51_0210:7:33:5109:13959      145     chr21   9719707 254     100M    chrY    10653706        0       TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTC    FEGDGEGEEEEGFEGEEGEEDFEGGGGGGEFGFGFAFFFEGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG    XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T2  SM:i:461        AS:i:0
EAS25_0078:8:23:14907:11377     89      chr21   9719708 254     100M    *       0       0       TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCA    EGEEEEGEBGFGGEEEEEGGDEBGFGGGGGGGGGGAGFGGGEGGGGGGGGGGGGGGGGGGGFGGGGGGGGEEGGGGGGGGGGGGGGGGGGGGGGGGGGGG    XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T3   SM:i:461        AS:i:0
EAS25_0078:8:23:14907:11377     165     chr21   9719708 255     *       *       0       0       CTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATATCTTCCCATAAAAACTAGACAGAAGGTTTCTAAGAA    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGEFGGGGGGGGGEGGGGGGGGGGFGFGDGGGGGFDFBFGCEBFFFEGFF    H0:i:1  H1:i:6 H2:i:60  SM:i:-1 AS:i:0
EAS25_0078:3:4:1830:9254        153     chr21   9719709 254     100M    *       0       0       GGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCAC    EEBE?EEEEEEEEGEEBEEEBGGGEGGGEFAFFECEEE=EDGGFGGGGDGFFGGGGGGGGGEEGGGGDGGGEGFGFGFGGGGGGGGGDGGGEGGFGGGGG    XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T4    SM:i:461        AS:i:0
Error creating thumbnail: Unable to save thumbnail to destination
using PICARD: add 'VALIDATION_STRINGENCY=LENIENT' to ignore formatting errors and 'QUIET=TRUE' to not output error messages

perform BAM quality control

get basic counts from a bam file with samtools

infolder="ori-Illumina_hg18_mapping"
 
samtools flagstat ${infolder}/NA18507_GAIIx_100_chr21.bam
 
15069635 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
14165019 + 0 mapped (94.00%:nan%)
15069635 + 0 paired in sequencing
7534598 + 0 read1
7535037 + 0 read2
12911388 + 0 properly paired (85.68%:nan%)
13260403 + 0 with itself and mate mapped
904616 + 0 singletons (6.00%:nan%)
79441 + 0 with mate mapped to a different chr
79024 + 0 with mate mapped to a different chr (mapQ>=5)
Error creating thumbnail: Unable to save thumbnail to destination
samtools tells us that forward and reverse read counts are different!!
Error creating thumbnail: Unable to save thumbnail to destination
samtools tells us that 12'911'388 reads out of 15'069'635 are properly paired!! (85.68% io the expected 100%)

perform similar analysis with picard

Picard has several analysis tools to parse BAM data and collect metrics; several will be exemplified later today.

CollectAlignmentSummaryMetrics
CollectGcBiasMetrics
CollectInsertSizeMetrics
CollectMultipleMetrics
CollectRnaSeqMetrics
CollectTargetedPcrMetrics


what commands does Picard offer?

# You can get the full list of Picard commands with a dry run:
java -jar $PICARD/picard.jar

get a detailed BAM alignment summary with picard

This time, we add VALIDATION_STRINGENCY=LENIENT to get feedback about of format errors but without stopping on the first error.

infolder="ori-Illumina_hg18_mapping"
infile="NA18507_GAIIx_100_chr21.bam"
 
# create a folder for Your results
outfolder="ori-Illumina_hg18_mapping_results"
mkdir -p ${outfolder}
 
java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics  \
	I=${infolder}/${infile} \
	O=${outfolder}/ori_CollectAlignmentSummaryMetrics.txt \
	VALIDATION_STRINGENCY=LENIENT
 
# to avoid such long scrolling text, we will later add an extra line
# that saves all error reports to a text file
# eg: 2>${outfolder}/ori_CollectAlignmentSummaryMetrics.err 
# '2>' is there to redirect error messages to ... the file defined by the remaining of the line
 
# ... a lot of error messages! then finally ...
cat ${outfolder}/ori_CollectAlignmentSummaryMetrics.txt

 

CATEGORY	TOTAL_READS	PF_READS	PCT_PF_READS	PF_NOISE_READS	PF_READS_ALIGNED	PCT_PF_READS_ALIGNED
PF_ALIGNED_BASES	PF_HQ_ALIGNED_READS	 PF_HQ_ALIGNED_BASES	PF_HQ_ALIGNED_Q20_BASES	PF_HQ_MEDIAN_MISMATCHES	PF_MISMATCH_RATE	
PF_HQ_ERROR_RATE	PF_INDEL_RATE	MEAN_READ_LENGTH	READS_ALIGNED_IN_PAIRS	PCT_READS_ALIGNED_IN_PAIRS	BAD_CYCLES	STRAND_BALANCE	
PCT_CHIMERAS	PCT_ADAPTER	SAMPLE	LIBRARY	READ_GROUP
FIRST_OF_PAIR	7534598	7534598	1	0	0	0	0	0	0	0	0	0	0	0	100	0	0	0
0	0	0.000011			
SECOND_OF_PAIR	7535037	7535037	1	0	0	0	0	0	0	0	0	0	0	0	100	0	0	0
0	0	0.00001			
PAIR	15069635	15069635	1	0	0	0	0	0	0	0	0	0	0	0	100	0	0
0	0	0	0.000011

The too many columns for few lines geometry makes that the result reads better after omitting comment lines, rotating and omitting the empty first column (admire the power of command line!?).

outfolder="ori-Illumina_hg18_mapping_results"
grep -v "^#" ${outfolder}/ori_CollectAlignmentSummaryMetrics.txt | \
	transpose -t | \
	cut -f 2- | \
	column -t
 
CATEGORY                    FIRST_OF_PAIR  SECOND_OF_PAIR  PAIR
TOTAL_READS                 7534598        7535037         15069635
PF_READS                    7534598        7535037         15069635
PCT_PF_READS                1              1               1
PF_NOISE_READS              0              0               0
PF_READS_ALIGNED            0              0               0
PCT_PF_READS_ALIGNED        0              0               0
PF_ALIGNED_BASES            0              0               0
PF_HQ_ALIGNED_READS         0              0               0
PF_HQ_ALIGNED_BASES         0              0               0
PF_HQ_ALIGNED_Q20_BASES     0              0               0
PF_HQ_MEDIAN_MISMATCHES     0              0               0
PF_MISMATCH_RATE            0              0               0
PF_HQ_ERROR_RATE            0              0               0
PF_INDEL_RATE               0              0               0
MEAN_READ_LENGTH            100            100             100
READS_ALIGNED_IN_PAIRS      0              0               0
PCT_READS_ALIGNED_IN_PAIRS  0              0               0
BAD_CYCLES                  0              0               0
STRAND_BALANCE              0              0               0
PCT_CHIMERAS                0              0               0
PCT_ADAPTER                 0.000011       0.00001         0.000011
SAMPLE
LIBRARY
READ_GROUP

what did we learn here?

Cli tools.png Bash Goodies

  • grep -v "^#" excludes lines starting with '#'
  • 'cut -f 2-' keeps columns 2 and upwards and removes the first (empty) column
  • 'transpose -t' converts (many columns x few rows) TO (few columns x many rows) that read better
  • 'column -t ' normalizes the table with spaces to align the text in each column

As can be seen, not all numbers match between the two methods. For instance, it is clear from the counts that the number of reads in both sets is unequal, which suggests that some reads have been removed from either or both groups. This will later be confirmed when fixing the BAM data for reanalysis.

identify common BAM problems

The Illumina BAM file has several frustrating issues that we will now fix.

Picard provides a command to validate SAM/BAM data and report all errors with their count through the file. The command can be run in quick mode to return only the first N errors or fully to report a count table as shown below.

validate a SAM/BAM file and report the first 100 ERRORS

infolder="ori-Illumina_hg18_mapping"
infile="NA18507_GAIIx_100_chr21.bam"
 
outfolder="ori-Illumina_hg18_mapping_results"
 
# 'IGNORE_WARNINGS=TRUE' omits top comments in the result
 
##############
# first 100 errors and no warnings
java -jar $PICARD/picard.jar ValidateSamFile \
	I=${infolder}/${infile} \
	MO=100 \
        VALIDATION_STRINGENCY=LENIENT \
        IGNORE_WARNINGS=TRUE \
	O=${outfolder}/ori_ValidateSamFile_MO100.txt
 
# the result can be viewed with 
head ${outfolder}/ori_ValidateSamFile_MO100.txt
ERROR: Read groups is empty
ERROR: Record 2, Read name EAS51_0210:3:6:3797:7459, Mapped mate should have mate reference name
ERROR: Record 2, Read name EAS51_0210:3:6:3797:7459, MAPQ should be 0 for unmapped read.
ERROR: Record 1, Read name EAS51_0210:3:6:3797:7459, Mate alignment does not match alignment start of mate
ERROR: Record 1, Read name EAS51_0210:3:6:3797:7459, Mate reference index (MRNM) does not match reference index of mate
ERROR: Record 2, Read name EAS51_0210:3:6:3797:7459, Mate alignment does not match alignment start of mate
ERROR: Record 2, Read name EAS51_0210:3:6:3797:7459, Mate reference index (MRNM) does not match reference index of mate
ERROR: Record 5, Read name EAS25_0078:8:23:14907:11377, Mapped mate should have mate reference name
ERROR: Record 5, Read name EAS25_0078:8:23:14907:11377, MAPQ should be 0 for unmapped read.
ERROR: Record 4, Read name EAS25_0078:8:23:14907:11377, Mate alignment does not match alignment start of mate

From the first run, we often see:

  • MAPQ should be 0 for unmapped read.
  • Mapped mate should have mate reference name

We can build a more clever command to get a count of different errors found in the first part of the file. This is another demonstration of the power of the command line and piping commands

infolder="ori-Illumina_hg18_mapping_results"
infile=ori_ValidateSamFile_MO100.txt
 
# ignore the line indicating the limit at 100 errors
# filter with gawk and keep the last column of each 'WARNING' or 'ERROR' line ('$NF')
# sort the output to put similar lines together ('sort') and count each group ('uniq -c')
# count errors and summarize
 
grep -v "Maximum output of" ${infolder}/${infile} | \
    gawk 'BEGIN{FS=", ";OFS="\t"}{print $NF}' | \
    sort | \
    uniq -c
 
      1 ERROR: Read groups is empty
     17 Mapped mate should have mate reference name
     17 MAPQ should be 0 for unmapped read.
     33 Mate alignment does not match alignment start of mate
     32 Mate reference index (MRNM) does not match reference index of mate

This confirms the previous inspection with addition of one ERROR: Read groups is empty

validate a SAM/BAM file and report all ERRORS with counts

Replacing MO=100 by M=SUMMARY allows reporting the counts of all errors in the file.

infolder="ori-Illumina_hg18_mapping"
infile="NA18507_GAIIx_100_chr21.bam"
 
outfolder="ori-Illumina_hg18_mapping_results"
 
java -jar $PICARD/picard.jar ValidateSamFile \
	I=${infolder}/${infile} \
	M=SUMMARY \
        VALIDATION_STRINGENCY=LENIENT \
        IGNORE_WARNINGS=TRUE \
	O=${outfolder}/Illumina_ValidateSamFile_summary.txt

 

# inspect results
cat ${outfolder}/Illumina_ValidateSamFile_summary.txt
 
## HISTOGRAM	java.lang.String
Error Type	Count
ERROR:CIGAR_MAPS_OFF_REFERENCE	154
ERROR:INVALID_ALIGNMENT_START	2
ERROR:INVALID_FLAG_MATE_UNMAPPED	904616
ERROR:INVALID_MAPPING_QUALITY	904616
ERROR:MATE_NOT_FOUND	79441
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND	453969
ERROR:MISMATCH_MATE_ALIGNMENT_START	1809232
ERROR:MISMATCH_MATE_REF_INDEX	1809232
ERROR:MISSING_READ_GROUP	1


correct common BAM problems

fix incorrect mate information in the PE BAM file

Some providers or software produce incorrect mate pair information in their output. This picard tool parses the full BAM content and corrects this error. Do not forget to keep the 'VALIDATION_STRINGENCY=LENIENT' parameter to allow picard processing of your file in the presence of other issues.

# all picard commands are 'LENIENT' to continue error
infolder="ori-Illumina_hg18_mapping"
infile="NA18507_GAIIx_100_chr21.bam"
 
outfolder="ori-Illumina_hg18_mapping_results"
 
# FixMateInformation expects input in name order 
#  it will first resort the reads using picard internal sorting command then proceed
# we ask the output to be kept name-sorted
 
java -jar $PICARD/picard.jar FixMateInformation \
	I=${infolder}/${infile} \
	O=${outfolder}/nmsrt_fixmate_${infile} \
	SO=queryname \
	VALIDATION_STRINGENCY=LENIENT \
	2>${outfolder}/FixMateInformation.err

If we inspect the error log and kept the first 10 lines of this otherwise very long file we notice the resorting step on line #3

head -10 ${outfolder}/FixMateInformation.err
 
[Thu Feb 18 09:22:09 CET 2016] picard.sam.FixMateInformation INPUT=[ori-Illumina_hg18_mapping/NA18507_GAIIx_100_chr21.bam] OUTPUT=ori-Illumina_hg18_mapping_results/nmsrt_fixmate_NA18507_GAIIx_100_chr21.bam SORT_ORDER=queryname VALIDATION_STRINGENCY=LENIENT    ASSUME_SORTED=false ADD_MATE_CIGAR=true IGNORE_MISSING_MATES=true VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Feb 18 09:22:09 CET 2016] Executing as splaisan@r710bits on Linux 3.10.0-327.4.5.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_71-b15; Picard version: 2.1.0(25ebc07f7fbaa7c1a4a8e6c130c88c1d10681802_1454776546) IntelDeflater
INFO    2016-02-18 09:22:09     FixMateInformation      Sorting input into queryname order.
Ignoring SAM validation error: ERROR: Record 2, Read name EAS51_0210:3:6:3797:7459, Mapped mate should have mate reference name
Ignoring SAM validation error: ERROR: Record 2, Read name EAS51_0210:3:6:3797:7459, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 5, Read name EAS25_0078:8:23:14907:11377, Mapped mate should have mate reference name
Ignoring SAM validation error: ERROR: Record 5, Read name EAS25_0078:8:23:14907:11377, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 7, Read name EAS25_0078:3:4:1830:9254, Mapped mate should have mate reference name
Ignoring SAM validation error: ERROR: Record 7, Read name EAS25_0078:3:4:1830:9254, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 9, Read name EAS25_0078:7:72:6046:4654, Mapped mate should have mate reference name

Replacement command using samtools fixmate

Samtools also has a fixmate command that can be used as follows.
  • when using the more recent samtools htslib, add '-o' before the output name
  • the syntax shown below is for the older samtools v0.1.19
Error creating thumbnail: Unable to save thumbnail to destination
Note that before applying samtools fixmate, reads need first be sorted by name
infolder="ori-Illumina_hg18_mapping"
infile="NA18507_GAIIx_100_chr21.bam"
 
outfolder="ori-Illumina_hg18_mapping_results"
 
# first sorting by read names
samtools sort -n ${infolder}/${infile} \
	${outfolder}/st_nmsrt_${infile%%.bam} \
	2>${outfolder}/samtools_Sort.err
 
## when using the more recent samtools htslib, add '-o' before the base name of the output file
# samtools sort -n ${infolder}/${infile} \
# 	-o ${outfolder}/st_nmsrt_${infile%%.bam} \
# 	2>${outfolder}/samtools_Sort.err
 
# now fixing mate information
samtools fixmate ${outfolder}/st_nmsrt_${infile} \
	${outfolder}/st_fxmt_nmsrt_${infile} \
	2>${outfolder}/samtools_FixMateInformation.err

add or replace absent or incorrect 'read groups' in a BAM file

Read groups abbreviated "@RG" is a very important annotation in a SAM/BAM header as it defines the provenance of the reads and will be required when comparing or mixing reads between samples. Some software or providers do not add this line to the SAM header while other software (GATK) do require it. This picard script adds a defined read group to a BAM file to correct this issue. A similar command can be built with samtools and a short manually-edited and corrected header stored in a text file as template.

infolder=ori-Illumina_hg18_mapping_results
infile=nmsrt_fixmate_NA18507_GAIIx_100_chr21.bam
 
outfolder="ori-Illumina_hg18_mapping_results"
outfile=rg_${infile}
 
## AddOrReplaceReadGroups
ID="NA18507"
LB="lib-NA18507"
PU="CASAVA_hg18"
PL="ILLUMINA"
SM="GAIIx_100_chr21"
 
java -jar $PICARD/picard.jar AddOrReplaceReadGroups \
	I=${infolder}/${infile} \
	O=${outfolder}/${outfile} \
	RGID=${ID} \
	RGLB=${LB} \
	RGPL=${PL} \
	RGPU=${PU} \
	RGSM=${SM} \
	VALIDATION_STRINGENCY=LENIENT \
	2>${outfolder}/AddOrReplaceReadGroups.err
 
# Look at the newly added line
samtools view -H ${outfolder}/${outfile}

The remaining errors logged in AddOrReplaceReadGroups.err are mostly MAPQ should be 0 for unmapped read.. This 'pseudo'-error is largely commented on fora like Biostar[4] and can safely be ignored.

The header now includes a @RG line required by many tools dedicated to SAM / BAM data.

@HD     VN:1.4  SO:queryname
@SQ     SN:chr1 LN:247249719
@SQ     SN:chr2 LN:242951149
@SQ     SN:chr3 LN:199501827
@SQ     SN:chr4 LN:191273063
@SQ     SN:chr5 LN:180857866
@SQ     SN:chr6 LN:170899992
@SQ     SN:chr7 LN:158821424
@SQ     SN:chrX LN:154913754
@SQ     SN:chr8 LN:146274826
@SQ     SN:chr9 LN:140273252
@SQ     SN:chr10        LN:135374737
@SQ     SN:chr11        LN:134452384
@SQ     SN:chr12        LN:132349534
@SQ     SN:chr13        LN:114142980
@SQ     SN:chr14        LN:106368585
@SQ     SN:chr15        LN:100338915
@SQ     SN:chr16        LN:88827254
@SQ     SN:chr17        LN:78774742
@SQ     SN:chr18        LN:76117153
@SQ     SN:chr19        LN:63811651
@SQ     SN:chr20        LN:62435964
@SQ     SN:chrY LN:57772954
@SQ     SN:chr22        LN:49691432
@SQ     SN:chr21        LN:46944323
@SQ     SN:chrM LN:16571
@RG     ID:NA18507      PL:ILLUMINA     PU:CASAVA_hg18  LB:lib-NA18507  SM:GAIIx_100_chr21
@PG     ID:CASAVA       VN:CASAVA-1.7.0 CL:/home/csaunders/devel/CASAVA_20091209/install_main/bin/run.pl -p . --targets bam --bamWholeGenome --bamChangeChromLabels=UCSC -sa --jobsLimit=16

remove un-paired reads from the Illumina chr21 subset data

We will finally fix the pairing problem in this bam file and remove all singleton reads with a custom script as there is no standard command to do this in either samtools or picard.

We also add a read pair index after read names in that script to better comply to the format standards.

Error creating thumbnail: Unable to save thumbnail to destination
For this step to succeed, the input file should be name-sorted (which was normally taken care of by Picard tools above)
infolder="ori-Illumina_hg18_mapping_results"
infile="rg_nmsrt_fixmate_NA18507_GAIIx_100_chr21.bam"
outfile="PE_${infile}"
 
# parse all reads and output correctly paired lines to a new bam file
samtools view -h ${infolder}/${infile} \
  | bam_re-pair.pl \
  | samtools view -bSo ${infolder}/${outfile} -
 
############################
# Results
# processed:        15069635
# passed-pairs:      7495097
# failed-reads:        79441

check the effect of removing singleton reads

The picard ValidateSamFile command with M=SUMMARY and the original data, is now reapplied to the cleaned data.

infolder="ori-Illumina_hg18_mapping_results"
infile="PE_rg_nmsrt_fixmate_NA18507_GAIIx_100_chr21.bam"
 
# 'IGNORE_WARNINGS=TRUE' to focus on real 'ERRORS' and ignore less hazardous 'WARNINGS'
 
##############
# M=SUMMARY
java -jar $PICARD/picard.jar ValidateSamFile \
	I=${infolder}/${infile} \
	M=SUMMARY \
        VALIDATION_STRINGENCY=LENIENT \
        IGNORE_WARNINGS=TRUE \
	O=${infolder}/PE_ValidateSamFile_Summary.txt
 
# the result can be viewed with 
cat ${infolder}/PE_ValidateSamFile_Summary.txt
 
## HISTOGRAM    java.lang.String
Error Type      Count
ERROR:CIGAR_MAPS_OFF_REFERENCE  278
ERROR:INVALID_MAPPING_QUALITY   904616
ERROR:MATE_NOT_FOUND    14990194

It is nice to see that we have indeed solved most of the original issues. What remains can be left as is.

get samtools statistics on correctly paired reads

infolder="ori-Illumina_hg18_mapping_results"
infile="PE_rg_nmsrt_fixmate_NA18507_GAIIx_100_chr21.bam"
 
# extract the name from the file-name (remove .bam)
name=${infile%%.bam}
 
samtools flagstat ${infolder}/${infile} > ${infolder}/${name}_flagstat.txt
 
# inspect with
cat ${infolder}/${name}_flagstat.txt
 
14990194 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
14085578 + 0 mapped (93.97%:-nan%)
14990194 + 0 paired in sequencing
7495097 + 0 read1
7495097 + 0 read2
12911388 + 0 properly paired (86.13%:-nan%)
13180962 + 0 with itself and mate mapped
904616 + 0 singletons (6.03%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

add @CO lines to the bam to log all data manipulations

Keeping track of what was done to your data is never a lost time. Picard tools can add the '@CO' line (also a @PG line) at run time when adding the parameter 'CO=' followed by a text describing the action (not done today but a good practice for you). Please read the help page of the Picard tools for more info about such options (java -jar $PICARD/picard.jar picard_tool_name -h)

some additional optional parameters shared by some but not all picard tools

PROGRAM_RECORD_ID=String
PG=String                     The program record ID for the @PG record(s) created by this program. Set to null to 
                              disable PG record creation.  This string may have a suffix appended to avoid collision 
                              with other program record IDs.  Default value: MarkDuplicates. This option can be set to 
                              'null' to clear the default value. 

PROGRAM_GROUP_VERSION=String
PG_VERSION=String             Value of VN tag of PG record to be created. If not specified, the version will be 
                              detected automatically.  Default value: null. 

PROGRAM_GROUP_COMMAND_LINE=String
PG_COMMAND=String             Value of CL tag of PG record to be created. If not supplied the command line will be 
                              detected automatically.  Default value: null. 

PROGRAM_GROUP_NAME=String
PG_NAME=String                Value of PN tag of PG record to be created.  Default value: MarkDuplicates. This option 
                              can be set to 'null' to clear the default value. 

COMMENT=String
CO=String                     Comment(s) to include in the output file's header.  This option may be specified 0 or 
                              more times.

There is no command to add a @CO header line out of the box but we can create a samtools piped line as follows.

Error creating thumbnail: Unable to save thumbnail to destination
Note the use of parentheses (...) to group the two commands creating the new header block followed by all reads.
infolder="ori-Illumina_hg18_mapping_results"
infile="PE_rg_nmsrt_fixmate_NA18507_GAIIx_100_chr21.bam"
outfile="PE_NA18507_GAIIx_100_chr21.bam"
 
# add comments for the record
costring='@CO\tPICARD (v1.101) FixMateInformation.jar and AddOrReplaceReadGroups
\n@CO\tbam_re-pair.pl (v1.0) to remove unpaired reads and keep only nice pairs'
 
# build a new header block on the fly and add all reads back to it
( samtools view -H ${infolder}/${infile}; \
    echo -e ${costring}; \
    samtools view ${infolder}/${infile} ) | \
    samtools view -Sb -o ${infolder}/${outfile} -

The final SAM header is now nicely GATK/PICARD compliant (note on the first @HD line the sort order now set to 'SO:queryname') .

samtools view -H ${infolder}/${outfile}
 
@HD     VN:1.4  SO:queryname
@SQ     SN:chr1 LN:247249719
@SQ     SN:chr2 LN:242951149
@SQ     SN:chr3 LN:199501827
@SQ     SN:chr4 LN:191273063
@SQ     SN:chr5 LN:180857866
@SQ     SN:chr6 LN:170899992
@SQ     SN:chr7 LN:158821424
@SQ     SN:chrX LN:154913754
@SQ     SN:chr8 LN:146274826
@SQ     SN:chr9 LN:140273252
@SQ     SN:chr10        LN:135374737
@SQ     SN:chr11        LN:134452384
@SQ     SN:chr12        LN:132349534
@SQ     SN:chr13        LN:114142980
@SQ     SN:chr14        LN:106368585
@SQ     SN:chr15        LN:100338915
@SQ     SN:chr16        LN:88827254
@SQ     SN:chr17        LN:78774742
@SQ     SN:chr18        LN:76117153
@SQ     SN:chr19        LN:63811651
@SQ     SN:chr20        LN:62435964
@SQ     SN:chrY LN:57772954
@SQ     SN:chr22        LN:49691432
@SQ     SN:chr21        LN:46944323
@SQ     SN:chrM LN:16571
@RG     ID:NA18507      PL:ILLUMINA     PU:CASAVA_hg18  LB:lib-NA18507  SM:GAIIx_100_chr21
@PG     ID:CASAVA       VN:CASAVA-1.7.0 CL:/home/csaunders/devel/CASAVA_20091209/install_main/bin/run.pl -p . --targets bam --bamWholeGenome --bamChangeChromLabels=UCSC -sa --jobsLimit=16
@CO     PICARD (v1.101) FixMateInformation.jar and AddOrReplaceReadGroups 
@CO     bam_re-pair.pl (v1.0) to remove unpaired reads and keep only nice pairs

 

produce a training subset with 10% original reads

All results produced during the exercise with full data are archived in a separate folder and a subset of reads is created to allow you try the commands presented here and get faster results.

create a working subset of the BAM file for use in the hands-on session

We generate here a 10% subset of the original BAM that you can use to quickly reproduce steps of this first exercise.

infolder="ori-Illumina_hg18_mapping_results"
infile="PE_NA18507_GAIIx_100_chr21.bam"
 
outfolder="Illumina_hg18_mapping"
outfile="NA18507_10pc_GAIIx_100_chr21.bam"
mkdir -p ${outfolder}
 
# P=0.1 should succeed with a 10% chance (produce a 10% subset)
java -jar $PICARD/picard.jar DownsampleSam \
	I=${infolder}/${infile} \
	O=${outfolder}/unsrted_${infile} \
	R=1 \
	P=0.1 \
	VALIDATION_STRINGENCY=LENIENT \
	QUIET=true \
	2>${outfolder}/10pc_DownsampleSam.err
 
# reads are (re)-sorted by name to keep pairs together
java -jar $PICARD/picard.jar SortSam \
	I=${outfolder}/unsrted_${infile} \
	O=${outfolder}/${outfile} \
	SO=queryname \
	CREATE_INDEX=false \
	QUIET=true \
	VALIDATION_STRINGENCY=LENIENT \
	2>${outfolder}/10pc_SortSam.err
Error creating thumbnail: Unable to save thumbnail to destination
The new Illumina_hg18_mapping folder now contains a small 140Mb 'NA18507_GAIIx_100_chr21.bam' file with 10% reads from the original. You can run all commands from this page with this sample but remember that you WILL obtain different results. You can also overwrite your copy with the trainer version of that file and proceed with it

download exercise files

Download exercise files here.

Use the right application to open the files present in ex1-files

References:
  1. http://www.illumina.com/truseq/tru_resources/datasets.ilmn
  2. ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/NA18507_GAIIx_100_chr21.bam
  3. http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM18507
  4. https://www.biostars.org/p/55830/

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 ]