.sam
Source 1
Source 2 (pdf)
What is SAM
SAM (Sequence Alignment/Map) is a generic alignment format for representing NGS sequencing reads mapped to a reference genome: it supports short and long reads, supports different sequencing platforms, flexible in style, compact in size and efficient in random access.
The SAM format consists of a header section (~metadata) and an alignment section (raw mapping information). The whole header section can be absent, but keeping the header is recommended.
The binary representation of SAM, is BAM. It is more compact and can be accessed quicker.
Contents
Example
@HD VN:1.0 @SQ SN:chr20 LN:62435964 @RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1 read_28701_28881_323b 147 chr20 28834 30 35M = 28701 -168 ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< MF:i:18 RG:Z:L2
@ - header section
Although the header is optional, format conditions exist. Header lines must start with @, which is followed by a code (HD, SQ, RG, ...). After this code, type:value pairs are reported (e.g. SN:chr20). If SQ is present in the header, RNAME (reference name - see alignment section) and MRNM (mate reference sequence name - see alignment section) must appear in an SQ header record
alignment section
Obliged fields
- QNAME: Query name of the read or the read pair
- FLAG: Bitwise flag (pairing, strand, mate strand, etc.)
- RNAME: Reference sequence name
- POS: 1-Based leftmost position of clipped alignment
- MAPQ: Mapping quality (Phred-scaled) (zero means unmapped)
- CIGAR: Extended CIGAR string (operations: MIDNSHP)
- MRNM: Mate reference name (‘=’ if same as RNAME)
- MPOS: 1-based leftmost mate position
- ISIZE: Inferred insert size
- SEQQuery: Sequence on the same strand as the
- reference
- QUAL: Query quality (ASCII-33=Phred base quality)
FLAG field
The FLAG field is important to determine the nature of the alignment of the read to the reference sequence. This number represent a certain combination of features of the alignment. Actually, the flag field consists of 10 bits, of which every bit is related to the abscence (0) or presence (1) of a feature.
Bit 0 (1) = The read is one of two or more reads coming from the same fragment during sequencing (e.g. a pair) Bit 1 (2) = All mate reads are mapped Bit 2 (4) = The query sequence is unmapped Bit 3 (8) = The next mate read is unmapped Bit 4 (16) = Strand of query (0=forward 1=reverse) Bit 5 (32) = Strand of the next mate read Bit 6 (64) = The first read of the mate reads Bit 7 (128) = The last read of the mate reads Bit 8 (256) = Secondary alignment Bit 9 (512) = Read fails quality checks Bit 10 (1024) = Read is PCR or optical duplicate
So while the features are stored in bitwise fashion, to report this number, the FLAG field converts these bits to a readable number, the number these bits represent. Unfortunately, this is number is still very user-unfriendly to read. Luckily, there are tools to convert this number into a feature list, like this one: https://broadinstitute.github.io/picard/explain-flags.html[1]. We see for example that a FLAG value of 16 means that the read is mapped on the reverse strand.
To convert a number to the alignment information, you can also check out this perl script, and the ASCII table. To view the complete SAM file in human readable FLAGs, use "samtools view -X".
Look at the explanation below for more info.
http://seqanswers.com/forums/showthread.php?t=2301
FLAG field is normally associated with other attributes to parse the read, likely, if this read is mapped, if this read is paired..... The SAM flag field - although it appears as a single number - actually contains several pieces of information which have been combined together. It is a bitwise field, which means that it makes use of the way that computers represent numbers to store several small values stored in one large value. If you think of a standard integer as being composed of 32 bits (0 or 1) then it would look like: 00000000000000000000000000000000 (note: the 'first' bit is the right most value) However SAM uses this single number as a series of boolean (true false) flags where each position in the array of bits represents a different sequence attribute Bit 0 = The read was part of a pair during sequencing Bit 1 = The read is mapped in a pair Bit 2 = The query sequence is unmapped Bit 3 = The mate is unmapped Bit 4 = Strand of query (0=forward 1=reverse) etc. Constructing the value from the individual flags is fairly easy. If the flag is false don't add anything to the total. If its true then add 2 raised to the power of the bit position. For example: 11010 Bit 0 - false - add nothing Bit 1 - true - add 2**1 = 2 Bit 2 - false - add nothing Bit 3 - true - add 2**3 = 8 Bit 4 - true - add 2**4 = 16 Bit pattern = 11010 = 16+8+2 = 26 So the flag value would be 26. To extract the individual flags from the compound value you can use a logical AND operation. This will tell you if a specific bit in the compound value is true or not. The exact syntax will depend on the language you're using, but in Perl for instance you could do: if ($flag & 16) { print "Reverse"; } else { print "Forward"; } To extract the information from the 4th (therefore 2**4 = 16) bit field. To view human readable FLAG, use "samtools view -X".
http://biostar.stackexchange.com/questions/7397/in-sam-format-clarify-the-meaning-of-the-0-flag
When the flag field is 0, it means none of the bitwise flags specified in the SAM spec (on page 4) are set. That means that your reads with flag 0 are unpaired (because the first flag, 0x1, is not set), successfully mapped to the reference (because 0x4 is not set) and mapped to the forward strand (because 0x10 is not set). Summarizing your data, the reads with flag 4 are unmapped, the reads with flag 0 are mapped to the forward strand and the reads with flag 16 are mapped to the reverse strand.
Optional fields
Check this pdf.
Tag Meaning NM Edit distance MD Mismatching positions/bases AS Alignment score X0 Number of best hits X1 Number of suboptimal hits found by BWA XN Number of ambiguous bases in the referenece XM Number of mismatches in the alignment XO Number of gap opens XG Number of gap extentions XT Type: Unique/Repeat/N/Mate-sw XA Alternative hits; format: (chr,pos,CIGAR,NM;)* XS Suboptimal alignment score XF Support from forward/reverse alignment XE Number of supporting seeds