NGS Exercise.1
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 ]
Download Illumina data - perform BAM QC - correct SAM/BAM formatting errors
This exercise was removed from the session program but is left to teach you how to recycle public NGS data from the internet.
Contents
- 1 Introduction
- 2 download the original bam file in terminal
- 3 inspect BAM content
- 4 perform BAM quality control
- 5 correct common BAM problems
- 6 produce a training subset with 10% original reads
- 7 download exercise files
Introduction
In this first session, we will inspect the BAM data and perform a number of controls to detect formatting errors and fix them when pssible.
The Illumina shared 'True Seq' resources page [1] provides various datasets and reference files used for this training. Obtaining data and installing tools is not covered in this session. If you wish to reproduce the environment available on your laptop today, please contact us.
The chr21 data was obtained from the ftp link [2]. This file is about 1GB in size and provides all reads mapped to chr21 (hg18) using the Illumina CASAVA suite in BAM format. As we will see in this exercise, the Illumina BAM file presents serious formatting issues that will be interesting to evaluate and fix during this session and get a general idea of this kind of maintenance tasks.
The sequencing data used in this session was obtained from gDNA extracted from EBV-transformed B-lymphocytes of a healthy Nigerian individual (NA18507). A lot of information is available for that sample from the Coriell repository where the gDNA is archived [3].
Precomputed results of this exercise have been moved to Final_Results/ori-Illumina_hg18_mapping_results/
download the original bam file in terminal
The following command will download the data in the current folder.
infolder="ori-Illumina_hg18_mapping" mkdir -p ${infolder} # wget allows downloading files from the internet # when you have the full URL to a file, you can download it (even if it is on a FTP server, read the 'man wget' pages) wget -P ${infolder} \ ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/NA18507_GAIIx_100_chr21.bam
what did we learn here?
- creating a folder when it does not exist yet
- create subfolder in that folder in one go (mkdir -p first/second/third-level where 'first', 'second', and 'third-level' are created, nested in each other)
- leave existing folder and/or subfolder unchanged if they already existed ('-p' this will not trigger errors when trying to create an already existing folder
inspect BAM content
While SAM data can be inspected using basic unix commands (head, less, grep, ...), BAM format instead is compressed and requires a specialized parser to read the file. Two broadly used tools to view BAM data are samtools and its java derivative picard.
samtools comprises only few distinct tools collapsed here
Version: 0.1.19+
Usage: samtools <command> [options]
Command: view SAM<->BAM conversion
sort sort alignment file
mpileup multi-way pileup
depth compute the depth
faidx index/extract FASTA
tview text alignment viewer
index index alignment
idxstats BAM index stats (r595 or later)
fixmate fix mate information
flagstat simple stats
calmd recalculate MD/NM tags and '=' bases
merge merge sorted alignments
rmdup remove PCR duplicates
reheader replace BAM header
cat concatenate BAMs
bedcov read depth per BED region
targetcut cut fosmid regions (for fosmid pool only)
phase phase heterozygotes
bamshuf shuffle and group alignments by name
picard, by contrast, comprises many more specific tools (below an overview for the current version)
# provided PICARD is defined and points to the place where the picard.jar file is located, you can type:
java -jar $PICARD/picard.jar <command-name> <command-options...>
## current sorted list of Picard commands:
AddCommentsToBam.jar, AddOrReplaceReadGroups.jar, BamIndexStats.jar, BamToBfq.jar, BuildBamIndex.jar,
CalculateHsMetrics.jar, CheckIlluminaDirectory.jar, CleanSam.jar, CollectAlignmentSummaryMetrics.jar,
CollectGcBiasMetrics.jar, CollectInsertSizeMetrics.jar, CollectMultipleMetrics.jar, CollectRnaSeqMetrics.jar,
CollectTargetedPcrMetrics.jar, CollectWgsMetrics.jar, CompareSAMs.jar, CreateSequenceDictionary.jar,
DownsampleSam.jar, EstimateLibraryComplexity.jar, ExtractIlluminaBarcodes.jar, ExtractSequences.jar,
FastqToSam.jar, FilterSamReads.jar, FixMateInformation.jar, GatherBamFiles.jar, IlluminaBasecallsToFastq.jar,
IlluminaBasecallsToSam.jar, IntervalListTools.jar, MakeSitesOnlyVcf.jar, MarkDuplicates.jar, MarkIlluminaAdapters.jar,
MeanQualityByCycle.jar, MergeBamAlignment.jar, MergeSamFiles.jar, MergeVcfs.jar, NormalizeFasta.jar,
QualityScoreDistribution.jar, ReorderSam.jar, ReplaceSamHeader.jar, RevertOriginalBaseQualitiesAndAddMateCigar.jar,
RevertSam.jar, SamFormatConverter.jar, SamToFastq.jar, SortSam.jar, SplitVcfs.jar, ValidateSamFile.jar,
VcfFormatConverter.jar, ViewSam.jar
view the top 5 lines of a bam file with samtools
infolder="ori-Illumina_hg18_mapping" samtools view ${infolder}/NA18507_GAIIx_100_chr21.bam | \ head -5 samtools view ${infolder}/NA18507_GAIIx_100_chr21.bam | head -5 EAS51_0210:3:6:3797:7459 89 chr21 9719702 73 100M * 0 0 ATATTTGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTC DDC@BEEEEEEGEEBFGEGG@EEDDBEEGEGGGGFFFFGEECGFGGEGGGGGGGGDFGGGFFGGGGFGFEGGGFGGGGGGBGGEGGFGGGGGGGGGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN34 SM:i:73 AS:i:0 EAS51_0210:3:6:3797:7459 165 chr21 9719702 255 * * 0 0 AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC H0:i:0 H1:i:0 H2:i:2 SM:i:-1 AS:i:0 EAS51_0210:7:33:5109:13959 145 chr21 9719707 254 100M chrY 10653706 0 TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTC FEGDGEGEEEEGFEGEEGEEDFEGGGGGGEFGFGFAFFFEGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T2 SM:i:461 AS:i:0 EAS25_0078:8:23:14907:11377 89 chr21 9719708 254 100M * 0 0 TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCA EGEEEEGEBGFGGEEEEEGGDEBGFGGGGGGGGGGAGFGGGEGGGGGGGGGGGGGGGGGGGFGGGGGGGGEEGGGGGGGGGGGGGGGGGGGGGGGGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T3 SM:i:461 AS:i:0 EAS25_0078:8:23:14907:11377 165 chr21 9719708 255 * * 0 0 CTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATATCTTCCCATAAAAACTAGACAGAAGGTTTCTAAGAA GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGEFGGGGGGGGGEGGGGGGGGGGFGFGDGGGGGFDFBFGCEBFFFEGFF H0:i:1 H1:i:6 H2:i:60 SM:i:-1 AS:i:0
what do you see in these lines?
- the five lines refer to three read pairs but one pair has only one read => The second read is missing and has probably been removed by Illumina to generate this chr21 subset
- the read names are identical within a pair => normally, paired reads end with /1 and /2 respectively
To also get the header, add -h after view. To see ONLY the header replace -h by -H
infolder="ori-Illumina_hg18_mapping" # print only the header samtools view ${infolder}/NA18507_GAIIx_100_chr21.bam -H # print the header followed by ALL read lines (ie: the full BAM content !!!) samtools view ${infolder}/NA18507_GAIIx_100_chr21.bam -h | more
what did we learn here?
view the same top 5 lines of a bam file with picard ViewSam
We added the @PG line of header seen with -H to the count to come to 6. VALIDATION_STRINGENCY=SILENT was also added to avoid getting error messages about inconsistent format.
infolder="ori-Illumina_hg18_mapping" infile="NA18507_GAIIx_100_chr21.bam" java -jar $PICARD/picard.jar ViewSam \ I=${infolder}/${infile} \ VALIDATION_STRINGENCY=SILENT \ QUIET=TRUE \ | head -6 @PG ID:CASAVA VN:CASAVA-1.7.0 CL:/home/csaunders/devel/CASAVA_20091209/install_main/bin/run.pl -p . --targets bam --bamWholeGenome --bamChangeChromLabels=UCSC -sa --jobsLimit=16 EAS51_0210:3:6:3797:7459 89 chr21 9719702 73 100M * 0 0 ATATTTGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTC DDC@BEEEEEEGEEBFGEGG@EEDDBEEGEGGGGFFFFGEECGFGGEGGGGGGGGDFGGGFFGGGGFGFEGGGFGGGGGGBGGEGGFGGGGGGGGGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN34 SM:i:73 AS:i:0 EAS51_0210:3:6:3797:7459 165 chr21 9719702 255 * * 0 0 AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC H0:i:0 H1:i:0 H2:i:2 SM:i:-1 AS:i:0 EAS51_0210:7:33:5109:13959 145 chr21 9719707 254 100M chrY 10653706 0 TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTC FEGDGEGEEEEGFEGEEGEEDFEGGGGGGEFGFGFAFFFEGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T2 SM:i:461 AS:i:0 EAS25_0078:8:23:14907:11377 89 chr21 9719708 254 100M * 0 0 TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCA EGEEEEGEBGFGGEEEEEGGDEBGFGGGGGGGGGGAGFGGGEGGGGGGGGGGGGGGGGGGGFGGGGGGGGEEGGGGGGGGGGGGGGGGGGGGGGGGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T3 SM:i:461 AS:i:0 EAS25_0078:8:23:14907:11377 165 chr21 9719708 255 * * 0 0 CTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATATCTTCCCATAAAAACTAGACAGAAGGTTTCTAAGAA GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGEFGGGGGGGGGEGGGGGGGGGGFGFGDGGGGGFDFBFGCEBFFFEGFF H0:i:1 H1:i:6 H2:i:60 SM:i:-1 AS:i:0 EAS25_0078:3:4:1830:9254 153 chr21 9719709 254 100M * 0 0 GGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCAC EEBE?EEEEEEEEGEEBEEEBGGGEGGGEFAFFECEEE=EDGGFGGGGDGFFGGGGGGGGGEEGGGGDGGGEGFGFGFGGGGGGGGGDGGGEGGFGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T4 SM:i:461 AS:i:0
perform BAM quality control
get basic counts from a bam file with samtools
infolder="ori-Illumina_hg18_mapping" samtools flagstat ${infolder}/NA18507_GAIIx_100_chr21.bam 15069635 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 14165019 + 0 mapped (94.00%:nan%) 15069635 + 0 paired in sequencing 7534598 + 0 read1 7535037 + 0 read2 12911388 + 0 properly paired (85.68%:nan%) 13260403 + 0 with itself and mate mapped 904616 + 0 singletons (6.00%:nan%) 79441 + 0 with mate mapped to a different chr 79024 + 0 with mate mapped to a different chr (mapQ>=5)
perform similar analysis with picard
Picard has several analysis tools to parse BAM data and collect metrics; several will be exemplified later today.
CollectAlignmentSummaryMetrics CollectGcBiasMetrics CollectInsertSizeMetrics CollectMultipleMetrics CollectRnaSeqMetrics CollectTargetedPcrMetrics
what commands does Picard offer?
java -jar $PICARD/picard.jar
get a detailed BAM alignment summary with picard
This time, we add VALIDATION_STRINGENCY=LENIENT to get feedback about of format errors but without stopping on the first error.
infolder="ori-Illumina_hg18_mapping" infile="NA18507_GAIIx_100_chr21.bam" # create a folder for Your results outfolder="ori-Illumina_hg18_mapping_results" mkdir -p ${outfolder} java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ I=${infolder}/${infile} \ O=${outfolder}/ori_CollectAlignmentSummaryMetrics.txt \ VALIDATION_STRINGENCY=LENIENT # to avoid such long scrolling text, we will later add an extra line # that saves all error reports to a text file # eg: 2>${outfolder}/ori_CollectAlignmentSummaryMetrics.err # '2>' is there to redirect error messages to ... the file defined by the remaining of the line # ... a lot of error messages! then finally ... cat ${outfolder}/ori_CollectAlignmentSummaryMetrics.txt
CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER SAMPLE LIBRARY READ_GROUP FIRST_OF_PAIR 7534598 7534598 1 0 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 0.000011 SECOND_OF_PAIR 7535037 7535037 1 0 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 0.00001 PAIR 15069635 15069635 1 0 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 0.000011
The too many columns for few lines geometry makes that the result reads better after omitting comment lines, rotating and omitting the empty first column (admire the power of command line!?).
outfolder="ori-Illumina_hg18_mapping_results" grep -v "^#" ${outfolder}/ori_CollectAlignmentSummaryMetrics.txt | \ transpose -t | \ cut -f 2- | \ column -t CATEGORY FIRST_OF_PAIR SECOND_OF_PAIR PAIR TOTAL_READS 7534598 7535037 15069635 PF_READS 7534598 7535037 15069635 PCT_PF_READS 1 1 1 PF_NOISE_READS 0 0 0 PF_READS_ALIGNED 0 0 0 PCT_PF_READS_ALIGNED 0 0 0 PF_ALIGNED_BASES 0 0 0 PF_HQ_ALIGNED_READS 0 0 0 PF_HQ_ALIGNED_BASES 0 0 0 PF_HQ_ALIGNED_Q20_BASES 0 0 0 PF_HQ_MEDIAN_MISMATCHES 0 0 0 PF_MISMATCH_RATE 0 0 0 PF_HQ_ERROR_RATE 0 0 0 PF_INDEL_RATE 0 0 0 MEAN_READ_LENGTH 100 100 100 READS_ALIGNED_IN_PAIRS 0 0 0 PCT_READS_ALIGNED_IN_PAIRS 0 0 0 BAD_CYCLES 0 0 0 STRAND_BALANCE 0 0 0 PCT_CHIMERAS 0 0 0 PCT_ADAPTER 0.000011 0.00001 0.000011 SAMPLE LIBRARY READ_GROUP
what did we learn here?
- grep -v "^#" excludes lines starting with '#'
- 'cut -f 2-' keeps columns 2 and upwards and removes the first (empty) column
- 'transpose -t' converts (many columns x few rows) TO (few columns x many rows) that read better
- 'column -t ' normalizes the table with spaces to align the text in each column
As can be seen, not all numbers match between the two methods. For instance, it is clear from the counts that the number of reads in both sets is unequal, which suggests that some reads have been removed from either or both groups. This will later be confirmed when fixing the BAM data for reanalysis.
identify common BAM problems
The Illumina BAM file has several frustrating issues that we will now fix.
Picard provides a command to validate SAM/BAM data and report all errors with their count through the file. The command can be run in quick mode to return only the first N errors or fully to report a count table as shown below.
validate a SAM/BAM file and report the first 100 ERRORS
infolder="ori-Illumina_hg18_mapping" infile="NA18507_GAIIx_100_chr21.bam" outfolder="ori-Illumina_hg18_mapping_results" # 'IGNORE_WARNINGS=TRUE' omits top comments in the result ############## # first 100 errors and no warnings java -jar $PICARD/picard.jar ValidateSamFile \ I=${infolder}/${infile} \ MO=100 \ VALIDATION_STRINGENCY=LENIENT \ IGNORE_WARNINGS=TRUE \ O=${outfolder}/ori_ValidateSamFile_MO100.txt # the result can be viewed with head ${outfolder}/ori_ValidateSamFile_MO100.txt
ERROR: Read groups is empty ERROR: Record 2, Read name EAS51_0210:3:6:3797:7459, Mapped mate should have mate reference name ERROR: Record 2, Read name EAS51_0210:3:6:3797:7459, MAPQ should be 0 for unmapped read. ERROR: Record 1, Read name EAS51_0210:3:6:3797:7459, Mate alignment does not match alignment start of mate ERROR: Record 1, Read name EAS51_0210:3:6:3797:7459, Mate reference index (MRNM) does not match reference index of mate ERROR: Record 2, Read name EAS51_0210:3:6:3797:7459, Mate alignment does not match alignment start of mate ERROR: Record 2, Read name EAS51_0210:3:6:3797:7459, Mate reference index (MRNM) does not match reference index of mate ERROR: Record 5, Read name EAS25_0078:8:23:14907:11377, Mapped mate should have mate reference name ERROR: Record 5, Read name EAS25_0078:8:23:14907:11377, MAPQ should be 0 for unmapped read. ERROR: Record 4, Read name EAS25_0078:8:23:14907:11377, Mate alignment does not match alignment start of mate
From the first run, we often see:
- MAPQ should be 0 for unmapped read.
- Mapped mate should have mate reference name
We can build a more clever command to get a count of different errors found in the first part of the file. This is another demonstration of the power of the command line and piping commands
infolder="ori-Illumina_hg18_mapping_results" infile=ori_ValidateSamFile_MO100.txt # ignore the line indicating the limit at 100 errors # filter with gawk and keep the last column of each 'WARNING' or 'ERROR' line ('$NF') # sort the output to put similar lines together ('sort') and count each group ('uniq -c') # count errors and summarize grep -v "Maximum output of" ${infolder}/${infile} | \ gawk 'BEGIN{FS=", ";OFS="\t"}{print $NF}' | \ sort | \ uniq -c 1 ERROR: Read groups is empty 17 Mapped mate should have mate reference name 17 MAPQ should be 0 for unmapped read. 33 Mate alignment does not match alignment start of mate 32 Mate reference index (MRNM) does not match reference index of mate
This confirms the previous inspection with addition of one ERROR: Read groups is empty
validate a SAM/BAM file and report all ERRORS with counts
Replacing MO=100 by M=SUMMARY allows reporting the counts of all errors in the file.
infolder="ori-Illumina_hg18_mapping" infile="NA18507_GAIIx_100_chr21.bam" outfolder="ori-Illumina_hg18_mapping_results" java -jar $PICARD/picard.jar ValidateSamFile \ I=${infolder}/${infile} \ M=SUMMARY \ VALIDATION_STRINGENCY=LENIENT \ IGNORE_WARNINGS=TRUE \ O=${outfolder}/Illumina_ValidateSamFile_summary.txt
# inspect results cat ${outfolder}/Illumina_ValidateSamFile_summary.txt ## HISTOGRAM java.lang.String Error Type Count ERROR:CIGAR_MAPS_OFF_REFERENCE 154 ERROR:INVALID_ALIGNMENT_START 2 ERROR:INVALID_FLAG_MATE_UNMAPPED 904616 ERROR:INVALID_MAPPING_QUALITY 904616 ERROR:MATE_NOT_FOUND 79441 ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 453969 ERROR:MISMATCH_MATE_ALIGNMENT_START 1809232 ERROR:MISMATCH_MATE_REF_INDEX 1809232 ERROR:MISSING_READ_GROUP 1
correct common BAM problems
fix incorrect mate information in the PE BAM file
Some providers or software produce incorrect mate pair information in their output. This picard tool parses the full BAM content and corrects this error. Do not forget to keep the 'VALIDATION_STRINGENCY=LENIENT' parameter to allow picard processing of your file in the presence of other issues.
# all picard commands are 'LENIENT' to continue error infolder="ori-Illumina_hg18_mapping" infile="NA18507_GAIIx_100_chr21.bam" outfolder="ori-Illumina_hg18_mapping_results" # FixMateInformation expects input in name order # it will first resort the reads using picard internal sorting command then proceed # we ask the output to be kept name-sorted java -jar $PICARD/picard.jar FixMateInformation \ I=${infolder}/${infile} \ O=${outfolder}/nmsrt_fixmate_${infile} \ SO=queryname \ VALIDATION_STRINGENCY=LENIENT \ 2>${outfolder}/FixMateInformation.err
If we inspect the error log and kept the first 10 lines of this otherwise very long file we notice the resorting step on line #3
head -10 ${outfolder}/FixMateInformation.err [Thu Feb 18 09:22:09 CET 2016] picard.sam.FixMateInformation INPUT=[ori-Illumina_hg18_mapping/NA18507_GAIIx_100_chr21.bam] OUTPUT=ori-Illumina_hg18_mapping_results/nmsrt_fixmate_NA18507_GAIIx_100_chr21.bam SORT_ORDER=queryname VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=false ADD_MATE_CIGAR=true IGNORE_MISSING_MATES=true VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json [Thu Feb 18 09:22:09 CET 2016] Executing as splaisan@r710bits on Linux 3.10.0-327.4.5.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_71-b15; Picard version: 2.1.0(25ebc07f7fbaa7c1a4a8e6c130c88c1d10681802_1454776546) IntelDeflater INFO 2016-02-18 09:22:09 FixMateInformation Sorting input into queryname order. Ignoring SAM validation error: ERROR: Record 2, Read name EAS51_0210:3:6:3797:7459, Mapped mate should have mate reference name Ignoring SAM validation error: ERROR: Record 2, Read name EAS51_0210:3:6:3797:7459, MAPQ should be 0 for unmapped read. Ignoring SAM validation error: ERROR: Record 5, Read name EAS25_0078:8:23:14907:11377, Mapped mate should have mate reference name Ignoring SAM validation error: ERROR: Record 5, Read name EAS25_0078:8:23:14907:11377, MAPQ should be 0 for unmapped read. Ignoring SAM validation error: ERROR: Record 7, Read name EAS25_0078:3:4:1830:9254, Mapped mate should have mate reference name Ignoring SAM validation error: ERROR: Record 7, Read name EAS25_0078:3:4:1830:9254, MAPQ should be 0 for unmapped read. Ignoring SAM validation error: ERROR: Record 9, Read name EAS25_0078:7:72:6046:4654, Mapped mate should have mate reference name
Replacement command using samtools fixmate
- when using the more recent samtools htslib, add '-o' before the output name
- the syntax shown below is for the older samtools v0.1.19
infolder="ori-Illumina_hg18_mapping" infile="NA18507_GAIIx_100_chr21.bam" outfolder="ori-Illumina_hg18_mapping_results" # first sorting by read names samtools sort -n ${infolder}/${infile} \ ${outfolder}/st_nmsrt_${infile%%.bam} \ 2>${outfolder}/samtools_Sort.err ## when using the more recent samtools htslib, add '-o' before the base name of the output file # samtools sort -n ${infolder}/${infile} \ # -o ${outfolder}/st_nmsrt_${infile%%.bam} \ # 2>${outfolder}/samtools_Sort.err # now fixing mate information samtools fixmate ${outfolder}/st_nmsrt_${infile} \ ${outfolder}/st_fxmt_nmsrt_${infile} \ 2>${outfolder}/samtools_FixMateInformation.err
add or replace absent or incorrect 'read groups' in a BAM file
Read groups abbreviated "@RG" is a very important annotation in a SAM/BAM header as it defines the provenance of the reads and will be required when comparing or mixing reads between samples. Some software or providers do not add this line to the SAM header while other software (GATK) do require it. This picard script adds a defined read group to a BAM file to correct this issue. A similar command can be built with samtools and a short manually-edited and corrected header stored in a text file as template.
infolder=ori-Illumina_hg18_mapping_results infile=nmsrt_fixmate_NA18507_GAIIx_100_chr21.bam outfolder="ori-Illumina_hg18_mapping_results" outfile=rg_${infile} ## AddOrReplaceReadGroups ID="NA18507" LB="lib-NA18507" PU="CASAVA_hg18" PL="ILLUMINA" SM="GAIIx_100_chr21" java -jar $PICARD/picard.jar AddOrReplaceReadGroups \ I=${infolder}/${infile} \ O=${outfolder}/${outfile} \ RGID=${ID} \ RGLB=${LB} \ RGPL=${PL} \ RGPU=${PU} \ RGSM=${SM} \ VALIDATION_STRINGENCY=LENIENT \ 2>${outfolder}/AddOrReplaceReadGroups.err # Look at the newly added line samtools view -H ${outfolder}/${outfile}
The remaining errors logged in AddOrReplaceReadGroups.err are mostly MAPQ should be 0 for unmapped read.. This 'pseudo'-error is largely commented on fora like Biostar[4] and can safely be ignored.
The header now includes a @RG line required by many tools dedicated to SAM / BAM data.
@HD VN:1.4 SO:queryname @SQ SN:chr1 LN:247249719 @SQ SN:chr2 LN:242951149 @SQ SN:chr3 LN:199501827 @SQ SN:chr4 LN:191273063 @SQ SN:chr5 LN:180857866 @SQ SN:chr6 LN:170899992 @SQ SN:chr7 LN:158821424 @SQ SN:chrX LN:154913754 @SQ SN:chr8 LN:146274826 @SQ SN:chr9 LN:140273252 @SQ SN:chr10 LN:135374737 @SQ SN:chr11 LN:134452384 @SQ SN:chr12 LN:132349534 @SQ SN:chr13 LN:114142980 @SQ SN:chr14 LN:106368585 @SQ SN:chr15 LN:100338915 @SQ SN:chr16 LN:88827254 @SQ SN:chr17 LN:78774742 @SQ SN:chr18 LN:76117153 @SQ SN:chr19 LN:63811651 @SQ SN:chr20 LN:62435964 @SQ SN:chrY LN:57772954 @SQ SN:chr22 LN:49691432 @SQ SN:chr21 LN:46944323 @SQ SN:chrM LN:16571 @RG ID:NA18507 PL:ILLUMINA PU:CASAVA_hg18 LB:lib-NA18507 SM:GAIIx_100_chr21 @PG ID:CASAVA VN:CASAVA-1.7.0 CL:/home/csaunders/devel/CASAVA_20091209/install_main/bin/run.pl -p . --targets bam --bamWholeGenome --bamChangeChromLabels=UCSC -sa --jobsLimit=16
remove un-paired reads from the Illumina chr21 subset data
We will finally fix the pairing problem in this bam file and remove all singleton reads with a custom script as there is no standard command to do this in either samtools or picard.
We also add a read pair index after read names in that script to better comply to the format standards.
infolder="ori-Illumina_hg18_mapping_results" infile="rg_nmsrt_fixmate_NA18507_GAIIx_100_chr21.bam" outfile="PE_${infile}" # parse all reads and output correctly paired lines to a new bam file samtools view -h ${infolder}/${infile} \ | bam_re-pair.pl \ | samtools view -bSo ${infolder}/${outfile} - ############################ # Results # processed: 15069635 # passed-pairs: 7495097 # failed-reads: 79441
check the effect of removing singleton reads
The picard ValidateSamFile command with M=SUMMARY and the original data, is now reapplied to the cleaned data.
infolder="ori-Illumina_hg18_mapping_results" infile="PE_rg_nmsrt_fixmate_NA18507_GAIIx_100_chr21.bam" # 'IGNORE_WARNINGS=TRUE' to focus on real 'ERRORS' and ignore less hazardous 'WARNINGS' ############## # M=SUMMARY java -jar $PICARD/picard.jar ValidateSamFile \ I=${infolder}/${infile} \ M=SUMMARY \ VALIDATION_STRINGENCY=LENIENT \ IGNORE_WARNINGS=TRUE \ O=${infolder}/PE_ValidateSamFile_Summary.txt # the result can be viewed with cat ${infolder}/PE_ValidateSamFile_Summary.txt ## HISTOGRAM java.lang.String Error Type Count ERROR:CIGAR_MAPS_OFF_REFERENCE 278 ERROR:INVALID_MAPPING_QUALITY 904616 ERROR:MATE_NOT_FOUND 14990194
It is nice to see that we have indeed solved most of the original issues. What remains can be left as is.
get samtools statistics on correctly paired reads
infolder="ori-Illumina_hg18_mapping_results" infile="PE_rg_nmsrt_fixmate_NA18507_GAIIx_100_chr21.bam" # extract the name from the file-name (remove .bam) name=${infile%%.bam} samtools flagstat ${infolder}/${infile} > ${infolder}/${name}_flagstat.txt # inspect with cat ${infolder}/${name}_flagstat.txt 14990194 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 14085578 + 0 mapped (93.97%:-nan%) 14990194 + 0 paired in sequencing 7495097 + 0 read1 7495097 + 0 read2 12911388 + 0 properly paired (86.13%:-nan%) 13180962 + 0 with itself and mate mapped 904616 + 0 singletons (6.03%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
add @CO lines to the bam to log all data manipulations
Keeping track of what was done to your data is never a lost time. Picard tools can add the '@CO' line (also a @PG line) at run time when adding the parameter 'CO=' followed by a text describing the action (not done today but a good practice for you). Please read the help page of the Picard tools for more info about such options (java -jar $PICARD/picard.jar picard_tool_name -h)
some additional optional parameters shared by some but not all picard tools
PROGRAM_RECORD_ID=String PG=String The program record ID for the @PG record(s) created by this program. Set to null to disable PG record creation. This string may have a suffix appended to avoid collision with other program record IDs. Default value: MarkDuplicates. This option can be set to 'null' to clear the default value. PROGRAM_GROUP_VERSION=String PG_VERSION=String Value of VN tag of PG record to be created. If not specified, the version will be detected automatically. Default value: null. PROGRAM_GROUP_COMMAND_LINE=String PG_COMMAND=String Value of CL tag of PG record to be created. If not supplied the command line will be detected automatically. Default value: null. PROGRAM_GROUP_NAME=String PG_NAME=String Value of PN tag of PG record to be created. Default value: MarkDuplicates. This option can be set to 'null' to clear the default value. COMMENT=String CO=String Comment(s) to include in the output file's header. This option may be specified 0 or more times.
There is no command to add a @CO header line out of the box but we can create a samtools piped line as follows.
infolder="ori-Illumina_hg18_mapping_results" infile="PE_rg_nmsrt_fixmate_NA18507_GAIIx_100_chr21.bam" outfile="PE_NA18507_GAIIx_100_chr21.bam" # add comments for the record costring='@CO\tPICARD (v1.101) FixMateInformation.jar and AddOrReplaceReadGroups \n@CO\tbam_re-pair.pl (v1.0) to remove unpaired reads and keep only nice pairs' # build a new header block on the fly and add all reads back to it ( samtools view -H ${infolder}/${infile}; \ echo -e ${costring}; \ samtools view ${infolder}/${infile} ) | \ samtools view -Sb -o ${infolder}/${outfile} -
The final SAM header is now nicely GATK/PICARD compliant (note on the first @HD line the sort order now set to 'SO:queryname') .
samtools view -H ${infolder}/${outfile} @HD VN:1.4 SO:queryname @SQ SN:chr1 LN:247249719 @SQ SN:chr2 LN:242951149 @SQ SN:chr3 LN:199501827 @SQ SN:chr4 LN:191273063 @SQ SN:chr5 LN:180857866 @SQ SN:chr6 LN:170899992 @SQ SN:chr7 LN:158821424 @SQ SN:chrX LN:154913754 @SQ SN:chr8 LN:146274826 @SQ SN:chr9 LN:140273252 @SQ SN:chr10 LN:135374737 @SQ SN:chr11 LN:134452384 @SQ SN:chr12 LN:132349534 @SQ SN:chr13 LN:114142980 @SQ SN:chr14 LN:106368585 @SQ SN:chr15 LN:100338915 @SQ SN:chr16 LN:88827254 @SQ SN:chr17 LN:78774742 @SQ SN:chr18 LN:76117153 @SQ SN:chr19 LN:63811651 @SQ SN:chr20 LN:62435964 @SQ SN:chrY LN:57772954 @SQ SN:chr22 LN:49691432 @SQ SN:chr21 LN:46944323 @SQ SN:chrM LN:16571 @RG ID:NA18507 PL:ILLUMINA PU:CASAVA_hg18 LB:lib-NA18507 SM:GAIIx_100_chr21 @PG ID:CASAVA VN:CASAVA-1.7.0 CL:/home/csaunders/devel/CASAVA_20091209/install_main/bin/run.pl -p . --targets bam --bamWholeGenome --bamChangeChromLabels=UCSC -sa --jobsLimit=16 @CO PICARD (v1.101) FixMateInformation.jar and AddOrReplaceReadGroups @CO bam_re-pair.pl (v1.0) to remove unpaired reads and keep only nice pairs
produce a training subset with 10% original reads
All results produced during the exercise with full data are archived in a separate folder and a subset of reads is created to allow you try the commands presented here and get faster results.
create a working subset of the BAM file for use in the hands-on session
We generate here a 10% subset of the original BAM that you can use to quickly reproduce steps of this first exercise.
infolder="ori-Illumina_hg18_mapping_results" infile="PE_NA18507_GAIIx_100_chr21.bam" outfolder="Illumina_hg18_mapping" outfile="NA18507_10pc_GAIIx_100_chr21.bam" mkdir -p ${outfolder} # P=0.1 should succeed with a 10% chance (produce a 10% subset) java -jar $PICARD/picard.jar DownsampleSam \ I=${infolder}/${infile} \ O=${outfolder}/unsrted_${infile} \ R=1 \ P=0.1 \ VALIDATION_STRINGENCY=LENIENT \ QUIET=true \ 2>${outfolder}/10pc_DownsampleSam.err # reads are (re)-sorted by name to keep pairs together java -jar $PICARD/picard.jar SortSam \ I=${outfolder}/unsrted_${infile} \ O=${outfolder}/${outfile} \ SO=queryname \ CREATE_INDEX=false \ QUIET=true \ VALIDATION_STRINGENCY=LENIENT \ 2>${outfolder}/10pc_SortSam.err
download exercise files
Download exercise files here.
References:
- ↑ http://www.illumina.com/truseq/tru_resources/datasets.ilmn
- ↑ ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/NA18507_GAIIx_100_chr21.bam
- ↑ http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM18507
- ↑ https://www.biostars.org/p/55830/
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 ]