Samtools

From BITS wiki
Jump to: navigation, search

The canonical BAM/SAM/CRAM data framework

SimilarTo.png: Picard, BamUtils, NGSUtils

Suggests.pngsuggests : bcftools


[ BioWare | Main_Page ]


Samtools[1][2] is the standard library to work on SAM, BAM (and in the future CRAM) files. A Samtools Java wrapper is available under the name Picard.

Samtools's versions

Samtools and its companion bcftools are in constant evolution as well as related apps like vcftools. This is due to the increasing need for speed and complex analysis triggered by the ever growing NGS community.

Consequently, two versions of Samtools are currently coexisting; the current version 1.x using the htslib engine which should be more performant and the older and classical version 0.x which is still used by many packages (tophat).

NEW official page for Samtools and relatives

GitHub pages with valid information and dev-versions

 

The new htslib version of Samtools 1.3.x

The current version 1.3.x of samtools was re-written to use the htslib engine and is faster and more adapted to high volume data. It also supports CRAM format and more http://www.htslib.org/[4].

Please consider migrating to samtools 1.3.x. Most commands remain unchanged except for here and there an additional argument or command detail. The transition should be painless if you can spend some time finding internet examples.

• samtools 1.3.1, the new (htslib) version of samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1 (using htslib 1.3.1)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     rmdup          remove PCR duplicates
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

## the accompanying perl tools 'plot-bamstats' and 'plot-vcfstats' are present to create gnuplot pictures from the samtools output data.
# additional perl scripts are 'blast2sam.pl export2sam.pl psl2sam.pl seq_cache_populate.pl wgsim_eval.pl
# bowtie2sam.pl interpolate_sam.pl sam2vcf.pl soap2sam.pl zoom2sam.pl color-chrs.pl novo2sam.pl samtools.pl vcfutils.pl'

• bgzip 1.3.1, the new (htslib) version of bgzip

Version: 1.3.1
Usage:   bgzip [OPTIONS] [FILE] ...
Options:
   -b, --offset INT        decompress at virtual file pointer (0-based uncompressed offset)
   -c, --stdout            write on standard output, keep original files unchanged
   -d, --decompress        decompress
   -f, --force             overwrite files without asking
   -h, --help              give this help
   -i, --index             compress and create BGZF index
   -I, --index-name FILE   name of BGZF index file [file.gz.gzi]
   -r, --reindex           (re)index compressed file
   -s, --size INT          decompress INT bytes (uncompressed size)
   -@, --threads INT       number of compression threads to use [1]

• tabix 1.3.1, the new (htslib) version of tabix

Version: 1.3.1
Usage:   tabix [OPTIONS] [FILE] [REGION [...]]

Indexing Options:
   -0, --zero-based           coordinates are zero-based
   -b, --begin INT            column number for region start [4]
   -c, --comment CHAR         skip comment lines starting with CHAR [null]
   -C, --csi                  generate CSI index for VCF (default is TBI)
   -e, --end INT              column number for region end (if no end, set INT to -b) [5]
   -f, --force                overwrite existing index without asking
   -m, --min-shift INT        set minimal interval size for CSI indices to 2^INT [14]
   -p, --preset STR           gff, bed, sam, vcf
   -s, --sequence INT         column number for sequence names (suppressed by -p) [1]
   -S, --skip-lines INT       skip first INT lines [0]

Querying and other options:
   -h, --print-header         print also the header lines
   -H, --only-header          print only the header lines
   -l, --list-chroms          list chromosome names
   -r, --reheader FILE        replace the header with the content of FILE
   -R, --regions FILE         restrict to regions listed in the file
   -T, --targets FILE         similar to -R but streams rather than index-jumps

 

The 'OLD version' of Samtools with embeded bcftools

samtools 0.1.19

Handicon.png samtools_0.1.19


commands in samtools version 0.1.19

Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.19-44428cd

Usage:   samtools <command> [options]

Command: view        SAM<->BAM conversion
         sort        sort alignment file
         mpileup     multi-way pileup
         depth       compute the depth
         faidx       index/extract FASTA
         tview       text alignment viewer
         index       index alignment
         idxstats    BAM index stats (r595 or later)
         fixmate     fix mate information
         flagstat    simple stats
         calmd       recalculate MD/NM tags and '=' bases
         merge       merge sorted alignments
         rmdup       remove PCR duplicates
         reheader    replace BAM header
         cat         concatenate BAMs
         bedcov      read depth per BED region
         targetcut   cut fosmid regions (for fosmid pool only)
         phase       phase heterozygotes
         bamshuf     shuffle and group alignments by name

the samtools bcftools built-in command

Handicon.png bcftools_0.1.19

Program: bcftools (Tools for data in the VCF/BCF formats)
Version: 0.1.19-44428cd

Usage:   bcftools <command> <arguments>

Command: view      print, extract, convert and call SNPs from BCF
         index     index BCF
         cat       concatenate BCFs
         ld        compute all-pair r^2
         ldpair    compute r^2 between requested pairs

example of cleaning, sorting, merging BAM files

I was personally helped a lot by Dave Tang's wiki page [5] reporting example commands for samtools:

eg. filter unmapped reads from 'bad.bam' to a new 'cleaned.bam', sort, and index

samtools view -h -F 4 -b bad.bam > tmp.bam
samtools sort tmp.bam cleaned
samtools index cleaned.bam && rm bad.bam tmp.bam

# merge multiple BAM files into one sorted file
# from http://www.biostars.org/p/9864/
find $BAM_DIR -name '*.bam' | {
    read firstbam
    samtools view -h "$firstbam"
    while read bam; do
        samtools view "$bam"
    done
} | samtools view -ubS - | samtools sort - merged
samtools index merged.bam
ls -l merged.bam merged.bam.bai

example of calling variants with the 'stable' version

# input should be a 'coordinate' sorted and indexed BAM file

ref=your-reference-genome.fasta; # should be indexed with samtools fa2idx
bamfile = your-sorted-and-indexed.bam

# pileup reads, generate raw BCF data, and call variants (SNVs and indels)
## samtools mpileup parameters
## -u generate uncompress BCF output
## bcftools view parameters
## -b output BCF instead of VCF
## -v output potential variant sites only (force -c)
## -c SNP calling (force -e)
## -e likelihood based analyses
## -g call genotypes at variant sites (force -c)
### as seen above several options forces other options
samtools_0.1.19 mpileup -uf ${ref} \
   ${bamfile} | bcftools_0.1.19 view -bvceg - > rawcalls.bcf

# parse the raw BCF data and filter out very-high coverage calls
## such calls likely originate from CNV compression in the reference genome
## -D1000 filters out calls with more than 1000 piled-up reads
### a common practice is to adjust -D to 20-30x your average coverage depth
bcftools_0.1.19 view rawcalls.bcf | \
   vcfutils_0.1.19.pl varFilter -D1000 > filtered-calls.vcf

# compress and index VCF calls
bgzip -c filtered-calls.vcf > filtered-calls.vcf.gz && \
   tabix -p vcf filtered-calls.vcf.gz



References:
  1. Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, 1000 Genome Project Data Processing Subgroup
    The Sequence Alignment/Map format and SAMtools.
    Bioinformatics: 2009, 25(16);2078-9
    [PubMed:19505943] ##WORLDCAT## [DOI] (I p)

  2. http://samtools.sourceforge.net
  3. http://samtools.github.io/bcftools/bcftools.html
  4. http://www.htslib.org/
  5. http://davetang.org/wiki/tiki-index.php?page=SAMTools



[ BioWare | Main_Page ]