.sam

From BITS wiki
Jump to: navigation, search

Source 1
Source 2 (pdf)
What is SAM

SAM (Sequence Alignment/Map) is a generic alignment format for representing NGS sequencing reads mapped to a reference genome: it supports short and long reads, supports different sequencing platforms, flexible in style, compact in size and efficient in random access.

The SAM format consists of a header section (~metadata) and an alignment section (raw mapping information). The whole header section can be absent, but keeping the header is recommended.

The binary representation of SAM, is BAM. It is more compact and can be accessed quicker.

Example

  @HD VN:1.0
  @SQ SN:chr20 LN:62435964
  @RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891
  @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 
  read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195  AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1
  read_28701_28881_323b 147 chr20 28834 30 35M = 28701 -168 ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< MF:i:18 RG:Z:L2

@ - header section

Although the header is optional, format conditions exist. Header lines must start with @, which is followed by a code (HD, SQ, RG, ...). After this code, type:value pairs are reported (e.g. SN:chr20). If SQ is present in the header, RNAME (reference name - see alignment section) and MRNM (mate reference sequence name - see alignment section) must appear in an SQ header record

alignment section

Obliged fields

  1. QNAME: Query name of the read or the read pair
  2. FLAG: Bitwise flag (pairing, strand, mate strand, etc.)
  3. RNAME: Reference sequence name
  4. POS: 1-Based leftmost position of clipped alignment
  5. MAPQ: Mapping quality (Phred-scaled) (zero means unmapped)
  6. CIGAR: Extended CIGAR string (operations: MIDNSHP)
  7. MRNM: Mate reference name (‘=’ if same as RNAME)
  8. MPOS: 1-based leftmost mate position
  9. ISIZE: Inferred insert size
  10. SEQQuery: Sequence on the same strand as the
  11. reference
  12. QUAL: Query quality (ASCII-33=Phred base quality)

FLAG field

The FLAG field is important to determine the nature of the alignment of the read to the reference sequence. This number represent a certain combination of features of the alignment. Actually, the flag field consists of 10 bits, of which every bit is related to the abscence (0) or presence (1) of a feature.

Bit 0 (1) = The read is one of two or more reads coming from the same fragment during sequencing (e.g. a pair)
Bit 1 (2) = All mate reads are mapped
Bit 2 (4) = The query sequence is unmapped
Bit 3 (8) = The next mate read is unmapped
Bit 4 (16) = Strand of query (0=forward 1=reverse)
Bit 5 (32) = Strand of the next mate read
Bit 6 (64) = The first read of the mate reads
Bit 7 (128) = The last read of the mate reads
Bit 8 (256) = Secondary alignment
Bit 9 (512) = Read fails quality checks
Bit 10 (1024) = Read is PCR or optical duplicate

So while the features are stored in bitwise fashion, to report this number, the FLAG field converts these bits to a readable number, the number these bits represent. Unfortunately, this is number is still very user-unfriendly to read. Luckily, there are tools to convert this number into a feature list, like this one: https://broadinstitute.github.io/picard/explain-flags.html[1]. We see for example that a FLAG value of 16 means that the read is mapped on the reverse strand.

To convert a number to the alignment information, you can also check out this perl script, and the ASCII table. To view the complete SAM file in human readable FLAGs, use "samtools view -X".

Look at the explanation below for more info.

http://seqanswers.com/forums/showthread.php?t=2301

FLAG field is normally associated with other attributes to parse the read, likely, if this read is mapped, if this 
read is paired..... 

The SAM flag field - although it appears as a single number - actually contains several pieces of information which
 have been combined together. It is a bitwise field, which means that it makes use of the way that computers 
represent numbers to store several small values stored in one large value.

If you think of a standard integer as being composed of 32 bits (0 or 1) then it would look like:

00000000000000000000000000000000   (note: the 'first' bit is the right most value)

However SAM uses this single number as a series of boolean (true false) flags where each position in the array of 
bits represents a different sequence attribute

Bit 0 = The read was part of a pair during sequencing
Bit 1 = The read is mapped in a pair
Bit 2 = The query sequence is unmapped
Bit 3 = The mate is unmapped
Bit 4 = Strand of query (0=forward 1=reverse)

etc.

Constructing the value from the individual flags is fairly easy. If the flag is false don't add anything to the 
total. If its true then add 2 raised to the power of the bit position.

For example:

11010

Bit 0 - false - add nothing
Bit 1 - true - add 2**1 = 2
Bit 2 - false - add nothing
Bit 3 - true - add 2**3 = 8
Bit 4 - true - add 2**4 = 16

Bit pattern = 11010 = 16+8+2 = 26

So the flag value would be 26.

To extract the individual flags from the compound value you can use a logical AND operation. This will tell you if
 a specific bit in the compound value is true or not. The exact syntax will depend on the language you're using, 
but in Perl for instance you could do:

if ($flag & 16) {
  print "Reverse";
}  else {
  print "Forward";
}


To extract the information from the 4th (therefore 2**4 = 16) bit field.

To view human readable FLAG, use "samtools view -X".

http://biostar.stackexchange.com/questions/7397/in-sam-format-clarify-the-meaning-of-the-0-flag

When the flag field is 0, it means none of the bitwise flags specified in the SAM spec (on page 4) are set. That 
means that your reads with flag 0 are unpaired (because the first flag, 0x1, is not set), successfully mapped to 
the reference (because 0x4 is not set) and mapped to the forward strand (because 0x10 is not set).


Summarizing your data, the reads with flag 4 are unmapped, the reads with flag 0 are mapped to the forward strand 
and the reads with flag 16 are mapped to the reverse strand.

Optional fields

Check this pdf.

Tag 	Meaning
NM 	Edit distance
MD 	Mismatching positions/bases
AS 	Alignment score
X0 	Number of best hits
X1 	Number of suboptimal hits found by BWA
XN 	Number of ambiguous bases in the referenece
XM 	Number of mismatches in the alignment
XO 	Number of gap opens
XG 	Number of gap extentions
XT 	Type: Unique/Repeat/N/Mate-sw
XA 	Alternative hits; format: (chr,pos,CIGAR,NM;)*
XS 	Suboptimal alignment score
XF 	Support from forward/reverse alignment
XE 	Number of supporting seeds
  1. https://broadinstitute.github.io/picard/explain-flags.html