SAM/BAM

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS data analysis | Mapping of NGS data ]

SAM and BAM files have the same content: read and mapping information after alignment of the reads to a reference genome.
However, BAM files are binary files and as such they are compact but not readable by humans.
SAM files are text files so humans can read them but they are a lot bigger than the BAM files.
Because of their compactness BAM files have become the standard format for storing the results of the mapping of reads to a reference sequence

SAM/BAM format

SAM/BAM reports a lot of information about the mapping. For each read that can be mapped to the reference sequence it reports:

  • position of the best hit (the region of the reference sequence where the reads aligned best)
  • a Phred score representing the quality of mapping
  • the sequence of the read
  • the quality string of the read (representing the base calling quality scores)
  • a so called CIGAR string that represents the alignment between the read and the best hit on the reference sequence

See a full detailed description of the SAM/BAM format[1].

the SAM/BAM header

The mapping information is preceded by a header, consisting of several lines. Each line in the header starts with a code that describes which type of information resides in that line:

  • header lines starting with @HD describe the SAM version (VN) and whether and how the file is sorted (SO)
  • header lines starting with @SQ describe the name and length of the reference sequences that were used to make the alignments
  • header lines starting with @PG describe software used to do the mapping

Here is the header from the BAM file of the NGS variant analysis training.

@HD     VN:1.3  SO:coordinate
@PG     ID:CASAVA       VN:CASAVA-1.7.0 CL:/home/csaunders/devel/CASAVA_20091209/install_main/bin/run.pl -p . --targets bam --bamWholeGenome --bamChangeChromLabels=UCSC -sa --jobsLimit=16
@SQ     SN:chr1 LN:247249719
@SQ     SN:chr2 LN:242951149
@SQ     SN:chr3 LN:199501827
@SQ     SN:chr4 LN:191273063
@SQ     SN:chr5 LN:180857866
@SQ     SN:chr6 LN:170899992
@SQ     SN:chr7 LN:158821424
@SQ     SN:chrX LN:154913754
@SQ     SN:chr8 LN:146274826
...
@SQ     SN:chr20        LN:62435964
@SQ     SN:chrY LN:57772954
@SQ     SN:chr22        LN:49691432
@SQ     SN:chr21        LN:46944323
@SQ     SN:chrM LN:16571

the actual data in the SAM/BAM file

Here is an example of a line of data from a BAM file.

61C2DAAXX:4:91:1662:10658#0 16 chr1 999935   25 75M * 0 0   TTAAGGCTCCCATTTACACTATCGAAAAAGATGGGACAAGTGCTGAAACGTGTATGAT
CCGGTTCCATGCTGGTC
4BBBBCBBBACBBBBCBCBCDC@BCCCC@CBCACBCCCCCCCCCCCBCCCCCCCCCCC
CCCCCCCCCCCCCCCC NM:i:0 NH:i:1

Each line in the file contains the information of the alignment between one of the reads and the reference sequence:

  • Column 1: 61C2DAAXX:4:91:1662:10658#0 QNAME name of the read
  • Column 2: 16 FLAG number that describes the alignment
  • Column 3: chr1 RNAME name of the reference sequence
  • Column 4: 999935 POS position on the reference sequence where the alignment starts
  • Column 5: 255 MAPQ Phred score representing the quality of the mapping
  • Column 6: 75M CIGAR indicating that the alignment consists of a 75 bp match/mismatch

Col Field Type Regexp/Range Brief description 7 RNEXT String \*|=|[!-()+-<>-~][!-~]* Ref. name of the mate/next read 8 PNEXT Int [0,231 -1] Position of the mate/next read 9 TLEN Int [-231 +1,231 -1] observed Template LENgth 10 SEQ String \*|[A-Za-z=.]+ segment SEQuence 11 QUAL String [!-~]+ ASCII of Phred-scaled base QUALity+33

Handicon.png 'QNAME' is the name taken from the original fastQ record while fields 'SEQ' and 'QUAL' reproduce the sequence content of the original fastQ

  • look at the 5' end of the first bam sequence and compare it to the two fastQ records above!
  • look at the 3' end of the second bam sequence and compare it to the two fastQ records above!

what did you find?

Try it by yourself before expanding on the right!

  • the first bam record is identical to the second fastq record (AACCTTTGTTT...)
  • the second bam record is identical to the first fastQ record after reverse-complementing it (...GGTTGGATTC-3' => (5'-GAATCCAACC...). This is because of the forward-reverse library structure.

For more information about read orientation, refer to the following IGV help page: http://www.broadinstitute.org/software/igv/interpreting_pair_orientations[2]


readpairorientations.jpg


the special FLAG field#2

This field contains aggregated binary information that can be deciphered online at http://picard.sourceforge.net/explain-flags.html[3]



References:
  1. http://samtools.github.io/hts-specs/SAMv1.pdf
  2. http://www.broadinstitute.org/software/igv/interpreting_pair_orientations
  3. http://picard.sourceforge.net/explain-flags.html