SAM/BAM

[ Main_Page | NGS data analysis | Mapping of NGS data ]

SAM and BAM files have the same content: read and mapping information after alignment of the reads to a reference genome.
However, BAM files are binary files and as such they are compact but not readable by humans.
SAM files are text files so humans can read them but they are a lot bigger than the BAM files.
Because of their compactness BAM files have become the standard format for storing the results of the mapping of reads to a reference sequence

SAM/BAM format

SAM/BAM reports a lot of information about the mapping. For each read that can be mapped to the reference sequence it reports:

position of the best hit (the region of the reference sequence where the reads aligned best)
a Phred score representing the quality of mapping
the sequence of the read
the quality string of the read (representing the base calling quality scores)
a so called CIGAR string that represents the alignment between the read and the best hit on the reference sequence

See a full detailed description of the SAM/BAM format^[1].

the SAM/BAM header

The mapping information is preceded by a header, consisting of several lines. Each line in the header starts with a code that describes which type of information resides in that line:

header lines starting with @HD describe the SAM version (VN) and whether and how the file is sorted (SO)
header lines starting with @SQ describe the name and length of the reference sequences that were used to make the alignments
header lines starting with @PG describe software used to do the mapping

Here is the header from the BAM file of the NGS variant analysis training.

@HD VN:1.3 SO:coordinate
@PG ID:CASAVA VN:CASAVA-1.7.0 CL:/home/csaunders/devel/CASAVA_20091209/install_main/bin/run.pl -p . --targets bam --bamWholeGenome --bamChangeChromLabels=UCSC -sa --jobsLimit=16
@SQ SN:chr1 LN:247249719
@SQ SN:chr2 LN:242951149
@SQ SN:chr3 LN:199501827
@SQ SN:chr4 LN:191273063
@SQ SN:chr5 LN:180857866
@SQ SN:chr6 LN:170899992
@SQ SN:chr7 LN:158821424
@SQ SN:chrX LN:154913754
@SQ SN:chr8 LN:146274826
...
@SQ SN:chr20 LN:62435964
@SQ SN:chrY LN:57772954
@SQ SN:chr22 LN:49691432
@SQ SN:chr21 LN:46944323
@SQ SN:chrM LN:16571

the actual data in the SAM/BAM file

Here is an example of a line of data from a BAM file.

61C2DAAXX:4:91:1662:10658#0 16 chr1 999935   25 75M * 0 0   TTAAGGCTCCCATTTACACTATCGAAAAAGATGGGACAAGTGCTGAAACGTGTATGAT
CCGGTTCCATGCTGGTC
4BBBBCBBBACBBBBCBCBCDC@BCCCC@CBCACBCCCCCCCCCCCBCCCCCCCCCCC
CCCCCCCCCCCCCCCC NM:i:0 NH:i:1

Each line in the file contains the information of the alignment between one of the reads and the reference sequence:

Column 1: 61C2DAAXX:4:91:1662:10658#0 QNAME name of the read
Column 2: 16 FLAG number that describes the alignment
Column 3: chr1 RNAME name of the reference sequence
Column 4: 999935 POS position on the reference sequence where the alignment starts
Column 5: 255 MAPQ Phred score representing the quality of the mapping
Column 6: 75M CIGAR indicating that the alignment consists of a 75 bp match/mismatch

Col Field Type Regexp/Range Brief description 7 RNEXT String \*|=|[!-()+-<>-~][!-~]* Ref. name of the mate/next read 8 PNEXT Int [0,231 -1] Position of the mate/next read 9 TLEN Int [-231 +1,231 -1] observed Template LENgth 10 SEQ String \*|[A-Za-z=.]+ segment SEQuence 11 QUAL String [!-~]+ ASCII of Phred-scaled base QUALity+33

'QNAME' is the name taken from the original fastQ record while fields 'SEQ' and 'QUAL' reproduce the sequence content of the original fastQ

look at the 5' end of the first bam sequence and compare it to the two fastQ records above!
look at the 3' end of the second bam sequence and compare it to the two fastQ records above!

what did you find?

Try it by yourself before expanding on the right!

the first bam record is identical to the second fastq record (AACCTTTGTTT...)
the second bam record is identical to the first fastQ record after reverse-complementing it (...GGTTGGATTC-3' => (5'-GAATCCAACC...). This is because of the forward-reverse library structure.

For more information about read orientation, refer to the following IGV help page: http://www.broadinstitute.org/software/igv/interpreting_pair_orientations^[2]

the special FLAG field#2

This field contains aggregated binary information that can be deciphered online at http://picard.sourceforge.net/explain-flags.html^[3]

References:

[1] ttp://samtools.github.io/hts-specs/SAMv1.pdf

[2] ttp://www.broadinstitute.org/software/igv/interpreting_pair_orientations

[3] ttp://picard.sourceforge.net/explain-flags.html

[1]

[2]

[3]

SAM/BAM

Contents

SAM/BAM format

the SAM/BAM header

the actual data in the SAM/BAM file

the special FLAG field#2

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox