SAM and BAM files have the same content: read and mapping information after alignment of the reads to a reference genome.
However, BAM files are binary files and as such they are compact but not readable by humans.
SAM files are text files so humans can read them but they are a lot bigger than the BAM files.
Because of their compactness BAM files have become the standard format for storing the results of the mapping of reads to a reference sequence
SAM/BAM reports a lot of information about the mapping. For each read that can be mapped to the reference sequence it reports:
- position of the best hit (the region of the reference sequence where the reads aligned best)
- a Phred score representing the quality of mapping
- the sequence of the read
- the quality string of the read (representing the base calling quality scores)
- a so called CIGAR string that represents the alignment between the read and the best hit on the reference sequence
the SAM/BAM header
The mapping information is preceded by a header, consisting of several lines. Each line in the header starts with a code that describes which type of information resides in that line:
- header lines starting with @HD describe the SAM version (VN) and whether and how the file is sorted (SO)
- header lines starting with @SQ describe the name and length of the reference sequences that were used to make the alignments
- header lines starting with @PG describe software used to do the mapping
Here is the header from the BAM file of the NGS variant analysis training.
@PG ID:CASAVA VN:CASAVA-1.7.0 CL:/home/csaunders/devel/CASAVA_20091209/install_main/bin/run.pl -p . --targets bam --bamWholeGenome --bamChangeChromLabels=UCSC -sa --jobsLimit=16
@SQ SN:chr1 LN:247249719
@SQ SN:chr2 LN:242951149
@SQ SN:chr3 LN:199501827
@SQ SN:chr4 LN:191273063
@SQ SN:chr5 LN:180857866
@SQ SN:chr6 LN:170899992
@SQ SN:chr7 LN:158821424
@SQ SN:chrX LN:154913754
@SQ SN:chr8 LN:146274826
@SQ SN:chr20 LN:62435964
@SQ SN:chrY LN:57772954
@SQ SN:chr22 LN:49691432
@SQ SN:chr21 LN:46944323
@SQ SN:chrM LN:16571
the actual data in the SAM/BAM file
Here is an example of a line of data from a BAM file.
61C2DAAXX:4:91:1662:10658#0 16 chr1 999935 25 75M * 0 0 TTAAGGCTCCCATTTACACTATCGAAAAAGATGGGACAAGTGCTGAAACGTGTATGAT CCGGTTCCATGCTGGTC 4BBBBCBBBACBBBBCBCBCDC@BCCCC@CBCACBCCCCCCCCCCCBCCCCCCCCCCC CCCCCCCCCCCCCCCC NM:i:0 NH:i:1
Each line in the file contains the information of the alignment between one of the reads and the reference sequence:
- Column 1: 61C2DAAXX:4:91:1662:10658#0 QNAME name of the read
- Column 2: 16 FLAG number that describes the alignment
- Column 3: chr1 RNAME name of the reference sequence
- Column 4: 999935 POS position on the reference sequence where the alignment starts
- Column 5: 255 MAPQ Phred score representing the quality of the mapping
- Column 6: 75M CIGAR indicating that the alignment consists of a 75 bp match/mismatch
Col Field Type Regexp/Range Brief description 7 RNEXT String \*|=|[!-()+-<>-~][!-~]* Ref. name of the mate/next read 8 PNEXT Int [0,231 -1] Position of the mate/next read 9 TLEN Int [-231 +1,231 -1] observed Template LENgth 10 SEQ String \*|[A-Za-z=.]+ segment SEQuence 11 QUAL String [!-~]+ ASCII of Phred-scaled base QUALity+33
- look at the 5' end of the first bam sequence and compare it to the two fastQ records above!
- look at the 3' end of the second bam sequence and compare it to the two fastQ records above!
what did you find?
the special FLAG field#2