NGS-formats

From BITS wiki
Jump to: navigation, search


Understanding and mastering NGS data formats is key to any analysis. You will spend some time learning how to do this and often will need Dr Google to find back how exactly you should access your data. The code below gives a general introduction to what you will encounter.


[ Main_Page | NGS_data_analysis | Hands-on introduction to NGS variant analysis |
| NGS Exercise.1 ]


The NGS format trinity

While fastQ is still the default format to store NGS reads, BAM became the default format to store read and mapping information after alignment of a read to a reference genome. Finally, VCF (variant call format) was recently developped during the 1000 genome project, and became de facto the standard variant manipulation format. (FastQ and VCF will be detailed later today)

Technical.png CRAM is at the corner of the street! This 'new' / 'improved' BAM format may be adopted for its increased compactness. Read more about CRAM here

FASTQ format

In short, fastQ provides two lines of data and a header line (that starts with '@' and can be repeated or duplicated by a line containing starting by '+'. The header contains a unique identifier (name) and important information about the sequencer origin of the current read. A correct fastQ record is a 4-line block reads as the following. The Base call quality scores are represented by ASCCI characters. If you subtract 33 from the character ascii code, you obtain the exponent of confidence score of the call for a given nucleotide (Q=10^-<ascii-33>).

A pair of reads from the paired end library looks like this (note the /1 and /2 suffix)

@EAS51_0210:3:6:3797:7459/1
GAATCCAACCCTCACAAAGAAGTTTCTCAGAATTCTTCCATCGAATTTTTATGTGATGGTATTTCCTTTTTTACCATAGGCCTCAAAGCGCTCCAAATAT
+
GGGGGGGGGGGGGFGGEGGBGGGGGGFGGGEFGFGGGGFFGGGFDGGGGGGGGEGGFGCEEGFFFFGGGGEGEEBDDEE@GGEGFBEEGEEEEEEB@CDD
@EAS51_0210:3:6:3797:7459/2
AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT
+
GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC

BAM format

The BAM format no more than a binary representation of the SAM format that is more compact and allows fast searching after indexing. BAM reports the address where the reads aligned best, followed by a score illustrating the qualit of mapping, the read sequence (as in the fastQ original file, and a text string of the same length as the sequence reporting the base-call quality for each nucleotide. Finally, the BAM format reports a so called CIGAR string that describes any mismatch between the read sequence and the reference genome at the alignment address. It is not necessary to read BAM fluently but it is strongly recommended to read the SAM format specifications from type to time to discover new features and refresh your memory.

The latest specification document is maintained by Heng Li (father of samtools, bwa, and other widely used tools) and can be found at:

a typical SAM/BAM header

one @HD header line describing the SAM version @SQ sequence dictionary lines describing each chromosome in the reference genome in a chosen sorting order @PG line to describe software used to generate the read group

Here is the header from today's initial data that contains only three distinct header types which is far from optimal.

samtools view -H NA18507_GAIIx_100_chr21.bam
@HD     VN:1.3  SO:coordinate
@PG     ID:CASAVA       VN:CASAVA-1.7.0 CL:/home/csaunders/devel/CASAVA_20091209/install_main/bin/run.pl -p . --targets bam --bamWholeGenome --bamChangeChromLabels=UCSC -sa --jobsLimit=16
@SQ     SN:chr1 LN:247249719
@SQ     SN:chr2 LN:242951149
@SQ     SN:chr3 LN:199501827
@SQ     SN:chr4 LN:191273063
@SQ     SN:chr5 LN:180857866
@SQ     SN:chr6 LN:170899992
@SQ     SN:chr7 LN:158821424
@SQ     SN:chrX LN:154913754
@SQ     SN:chr8 LN:146274826
@SQ     SN:chr9 LN:140273252
@SQ     SN:chr10        LN:135374737
@SQ     SN:chr11        LN:134452384
@SQ     SN:chr12        LN:132349534
@SQ     SN:chr13        LN:114142980
@SQ     SN:chr14        LN:106368585
@SQ     SN:chr15        LN:100338915
@SQ     SN:chr16        LN:88827254
@SQ     SN:chr17        LN:78774742
@SQ     SN:chr18        LN:76117153
@SQ     SN:chr19        LN:63811651
@SQ     SN:chr20        LN:62435964
@SQ     SN:chrY LN:57772954
@SQ     SN:chr22        LN:49691432
@SQ     SN:chr21        LN:46944323
@SQ     SN:chrM LN:16571

the first 5 data lines from the same BAM file

samtools view NA18507_GAIIx_100_chr21.bam | head -5
EAS51_0210:3:6:3797:7459        165     chr21   9719702 255     *       *       0       0       AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT    GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC    H0:i:0  H1:i:0  H2:i:2  SM:i:-1 AS:i:0
EAS51_0210:3:6:3797:7459        89      chr21   9719702 73      100M    *       0       0       ATATTTGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTC    DDC@BEEEEEEGEEBFGEGG@EEDDBEEGEGGGGFFFFGEECGFGGEGGGGGGGGDFGGGFFGGGGFGFEGGGFGGGGGGBGGEGGFGGGGGGGGGGGGG    XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN34       SM:i:73 AS:i:0
EAS51_0210:7:33:5109:13959      145     chr21   9719707 254     100M    chrY    10653706        0       TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTC    FEGDGEGEEEEGFEGEEGEEDFEGGGGGGEFGFGFAFFFEGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG    XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T2  SM:i:461        AS:i:0
EAS25_0078:8:23:14907:11377     165     chr21   9719708 255     *       *       0       0       CTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATATCTTCCCATAAAAACTAGACAGAAGGTTTCTAAGAA    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGEFGGGGGGGGGEGGGGGGGGGGFGFGDGGGGGFDFBFGCEBFFFEGFF    H0:i:1  H1:i:6  H2:i:60 SM:i:-1 AS:i:0
EAS25_0078:8:23:14907:11377     89      chr21   9719708 254     100M    *       0       0       TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCA    EGEEEEGEBGFGGEEEEEGGDEBGFGGGGGGGGGGAGFGGGEGGGGGGGGGGGGGGGGGGGFGGGGGGGGEEGGGGGGGGGGGGGGGGGGGGGGGGGGGG    XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T3   SM:i:461        AS:i:0

the meaning of the SAM columns is explained below

Col     Field   Type    Regexp/Range                    Brief description
1       QNAME   String  [!-?A-~]{1,255}                 Query  template NAME
2       FLAG    Int     [0,216 -1]                      bitwise FLAG
3       RNAME   String  \*|[!-()+-<>-~][!-~]*           Reference  sequence  NAME
4       POS     Int     [0,231 -1]      1-based         leftmost mapping POSition
5       MAPQ    Int     [0,28 -1]                       MAPping Quality
6       CIGAR   String  \*|([0-9]+[MIDNSHPX=])+         CIGAR string
7       RNEXT   String  \*|=|[!-()+-<>-~][!-~]*         Ref.  name  of the  mate/next read
8       PNEXT   Int     [0,231 -1]                      Position of the  mate/next read
9       TLEN    Int     [-231 +1,231 -1]                observed  Template LENgth
10      SEQ     String  \*|[A-Za-z=.]+                  segment SEQuence
11      QUAL    String  [!-~]+                          ASCII of Phred-scaled base  QUALity+33

Handicon.png 'QNAME' is the name taken from the original fastQ record while fields 'SEQ' and 'QUAL' reproduce the sequence content of the original fastQ

  • look at the 5' end of the first bam sequence and compare it to the two fastQ records above!
  • look at the 3' end of the second bam sequence and compare it to the two fastQ records above!

what did you find?

Try it by yourself before expanding on the right!

  • the first bam record is identical to the second fastq record (AACCTTTGTTT...)
  • the second bam record is identical to the first fastQ record after reverse-complementing it (...GGTTGGATTC-3' => (5'-GAATCCAACC...). This is because of the forward-reverse library structure.

For more information about read orientation, refer to the following IGV help page: http://www.broadinstitute.org/software/igv/interpreting_pair_orientations[3]


readpairorientations.jpg


the special FLAG field#2

This field contains aggregated binary information that can be deciphered online at http://picard.sourceforge.net/explain-flags.html[4]

BED format

BED format is often used to store coordinate based information about genes and genomic features. More informaton on this format and its relatives can be found on this wiki '.bed' or by asking Dr G.

BED files can be compressed with bgzip for rapid retrieval.

VCF format

[ The Variant Call Format (VCF) specified the format of a text file used in bioinformatics for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome.] (read more on the wikipedia page (http://en.wikipedia.org/wiki/Variant_Call_Format[5]).

The main tools able to manipulate VCF files in vcftools; see home page with a lot of information http://vcftools.sourceforge.net [6]

Specification information can be found in the latest content version (needs building): https://github.com/samtools/hts-specs [2]

Below are two figures from a VCFtools poster (http://vcftools.sourceforge.net/VCF-poster.pdf[7]) that show typical VCF calls in two genome samples and explain them.


vcf_fig1.png
vcf_fig2.png

Compressing NGS data, creating indexes, and extracting subregions

Although FASTA, SAM, VCF, and BED can be read using a text processor, it becomes rapidely challenging to find or extract a sequence or row out of the multitude of lines contained in these NGS files. The solution to the problem is to compress the files (often after sorting them in coordinate order) and create an 'index' that can be used for fast and random access retrieval of data.

Handicon.png More examples in our exercise 9

FASTA data

extract FASTA subsequences with samtools faidx

the samtools faidx command parameters

Usage:   samtools faidx <file.fa|file.fa.gz> [<reg> [...]]

compress FASTA data with razip and bgzip

NGS data needs be compressed in order to reduce file size and access data without having to proceed through the whole file (tanks to the index). Several compression programs exists that have been used with NGS data. The older one is razip which tends to be now replaced by the improved bgzip, but not for all programs.

comment by Heng Li[8]

Picard does not support random access in a razip'ed file and it is quite complicated to implement the razip method in Java as razip has to access some low-level zlib APIs. Given the substantial amount of work, probably razip will never get implemented in Java, which means Picard will never read razip'ed fasta files. The key difference between razip and bgzip is that razip uses the real file offset while bgzip uses the virtual file offset. Razip is slightly older than bgzip. The very first version of samtools (not public) used razip instead of bgzip. We switched to bgzip/BGZF because it is simpler and much easier to reimplement in Java. It is actually possible to extend BGZF to make it support seeking with the real file offset, but this has not been done.

the razip command parameters

Usage:   razip [options] [file] ...
 
Options: -c      write on standard output, keep original files unchanged
         -d      decompress
         -l      list compressed file contents
         -b INT  decompress at INT position in the uncompressed file
         -s INT  decompress INT bytes in the uncompressed file
         -h      give this help
infile=ref/HiSeq_UCSC_hg19.fa
 
razip -c ${infile} > ${infilez}.razip
 
# after compression, the index can be created with 'samtools faidx'
samtools faidx ${infilez}.razip
 
# the index is a table of positions and lengths for each chromosome as seen here
cat ${infilez}.razip.fai
 
chrM	16571	6	70	71
chr1	249250621	16820	70	71
chr2	243199373	252828171	70	71
chr3	198022430	499501827	70	71
chr4	191154276	700353155	70	71
chr5	180915260	894238213	70	71
chr6	171115067	1077737983	70	71
chr7	159138663	1251297557	70	71
chr8	146364022	1412709636	70	71
chr9	141213431	1561164579	70	71
chr10	135534747	1704395352	70	71
chr11	135006516	1841866317	70	71
chr12	133851895	1978801505	70	71
chr13	115169878	2114565577	70	71
chr14	107349540	2231380746	70	71
chr15	102531392	2340263858	70	71
chr16	90354753	2444259992	70	71
chr17	81195210	2535905535	70	71
chr18	78077248	2618260684	70	71
chr19	59128983	2697453329	70	71
chr20	63025520	2757427019	70	71
chr21	48129895	2821352911	70	71
chr22	51304566	2870170383	70	71
chrX	155270560	2922207878	70	71
chrY	59373566	3079696595	70	71

extracting FASTA data from a local file

# point samtools to a local FASTA file indexed with samtols faidx
 
# the file can be standard text FASTA
infile=ref/HiSeq_UCSC_hg19.fa
 
# or can have been compressed with '''razip''' or '''bgzip'''
## compression with 'zip' of 'gzip' are not supported
infilez=ref/HiSeq_UCSC_hg19.fa.razip
 
## if it does not exist on your path, you can
## create the archived form with:
## bgzip -c ${infile} > ${infilez}
 
time samtools faidx \
    ${infile}\
    chr1:1,000,000-1,001,000 \
    chr2:2,000,000-2,001,000
 
time samtools faidx \
    ${infilez}\
    chr1:1,000,000-1,001,000 \
    chr2:2,000,000-2,001,000
 
# result in both cases is the same
>1:1,000,000-1,001,000
TGGGCACAGCCTCACCCAGGAAAGCAGCTGGGGGTCCACTGGGCTCAGGGAAGACCCCCT
GCCAGGGAGACCCCAGGCGCCTGAATGGCCACGGGAAGGAAAACCTACCAGCCCCTCCGT
GTGTCCTCCTGGCACATGGCGACCTCCATGACCCGACGAGGGTGCGGGGCCCGGGGCAGG
GTGGCCAGGTGCGGGGGTGCGGGGCCCGGGGCAGCTGCCCTCGGTGGGAGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
GGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
GGGTCTGCGGGGCCCTGGGGGGGGTGGGGTCTGCGGGGCCCTGGGGGTGTTGTGGTGGGG
TCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGTGCCCTCGGGGGGTGTGGTGGGG
TCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGGGGGGCCCTAAGCTTAGATGCAGGTC
TCTCCCTGGCAGCCCCTCAAGGCCACGAGGATCAGTGCTCGGAGCCTGGAGGGCTGTGTG
CAGGAGTAGCAGGGCCACTGATGCCAGCGGGAAGGCCAGGCAGGGCTTCTGGGTGGAGTT
CAAGGTGCATCCTGACCGCTGTCACCTTCAGACTCTGTCCCCTGGGGCTGGGGCAAGTGC
CCGATGGGAGCGCAGGGTCTGGGACTGTAGGGTCCAGCCCTACGGAGCTTAGCAGGTGTT
CTCCCCGTGTGTGGAGATGAGAGATTGTAATAAATAAAGAC
>2:2,000,000-2,001,000
AACCTAACCGTCTACAGTGGGGGCCGCCCTGCCTCAGTGTGGACTGTACAGAACCGAGTG
TAGACAGGCCACCTTTACCTACTGAGTGAGGGCCGTCTTGCCTCAGTGTGGGCTGTACAG
AACCGAGTGTAGACGGGCCACCTTTACCTAGTGAGTGAGGGCCGTCTTGCCTCAGTGTGG
ACTGCACAGAACCGAGTATAGATGGGCCGCCTTTACCTAGTGAGTGAGGGCTGCCCTGCC
TCAGTGTGGGCTGCACAGAGCCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGTGAGGGC
CGCCCTGCCTCAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTG
AGTGAGGGCCGCCCTGCCTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCCGCCTT
TACCTAGTGAGGGCCACCCTGCTTCAGTGTGGGCTGCACAGAACCGAGTGTAGATGGGCC
GCCTTTACCTAGTGAGTGAGGGCCACCCTACCTCAGTGTGGGCTGCACAGAACCGAGTGT
AGACGGGCCGCCTTTACCTAGTGAGCGAGGGCCGCCCTGCCTCAGTGTGGGCTGCACAGA
ACCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGTGAGGGCCGCCCTGCTTCAGTGTGGA
CTGCATGGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTGAGTGAGGGCTGCCCTGCCT
CAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGGGCCGCCC
TGCTTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTGAGTGA
GGGCCGCCCTGCCTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCCGCCTTTACCT
AGTGAGGGCCACCCTGCTTCAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCCGCCTT
TACCTTACAGGAAGTCTTCTCTCCTAGGAACGGAGGCAGCG
 
# adding time before the command returned:
## local uncompressed version
real	0m0.002s
user	0m0.001s
sys	0m0.001s
 
## local razip-compressed version
real	0m0.002s
user	0m0.000s
sys	0m0.002s
 
## the extraction time is identical 
## but the file was reduced from 3Gb to 845Mb
 
## note that not all tools read razip-archived data
## to use compressed FASTA with such tools
## decompress the data to a stream and pipe it to the command
 
fa_archive=some.fasta.razip
gzip -dc ${fa_archive} | some-command

extracting FASTA data from a WEB file

# point samtools to a razip-compressed and indexed fasta file on the internet and ask for 2 subsequences provided as coordinates
samtools faidx \
    ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz \
    1:1,000,000-1,001,000 \
    2:2,000,000-2,001,000
 
## the command will first download the index from the web-server to the local folder
## 'hs37d5.fa.gz.fai' (2.8K)
## then use it to retrieve subsequence(s)
 
## you therefore nee write access to the palce where you run the command
 
# result
>1:1,000,000-1,001,000
TGGGCACAGCCTCACCCAGGAAAGCAGCTGGGGGTCCACTGGGCTCAGGGAAGACCCCCT
GCCAGGGAGACCCCAGGCGCCTGAATGGCCACGGGAAGGAAAACCTACCAGCCCCTCCGT
GTGTCCTCCTGGCACATGGCGACCTCCATGACCCGACGAGGGTGCGGGGCCCGGGGCAGG
GTGGCCAGGTGCGGGGGTGCGGGGCCCGGGGCAGCTGCCCTCGGTGGGAGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
GGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
GGGTCTGCGGGGCCCTGGGGGGGGTGGGGTCTGCGGGGCCCTGGGGGTGTTGTGGTGGGG
TCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGTGCCCTCGGGGGGTGTGGTGGGG
TCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGGGGGGCCCTAAGCTTAGATGCAGGTC
TCTCCCTGGCAGCCCCTCAAGGCCACGAGGATCAGTGCTCGGAGCCTGGAGGGCTGTGTG
CAGGAGTAGCAGGGCCACTGATGCCAGCGGGAAGGCCAGGCAGGGCTTCTGGGTGGAGTT
CAAGGTGCATCCTGACCGCTGTCACCTTCAGACTCTGTCCCCTGGGGCTGGGGCAAGTGC
CCGATGGGAGCGCAGGGTCTGGGACTGTAGGGTCCAGCCCTACGGAGCTTAGCAGGTGTT
CTCCCCGTGTGTGGAGATGAGAGATTGTAATAAATAAAGAC
>2:2,000,000-2,001,000
AACCTAACCGTCTACAGTGGGGGCCGCCCTGCCTCAGTGTGGACTGTACAGAACCGAGTG
TAGACAGGCCACCTTTACCTACTGAGTGAGGGCCGTCTTGCCTCAGTGTGGGCTGTACAG
AACCGAGTGTAGACGGGCCACCTTTACCTAGTGAGTGAGGGCCGTCTTGCCTCAGTGTGG
ACTGCACAGAACCGAGTATAGATGGGCCGCCTTTACCTAGTGAGTGAGGGCTGCCCTGCC
TCAGTGTGGGCTGCACAGAGCCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGTGAGGGC
CGCCCTGCCTCAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTG
AGTGAGGGCCGCCCTGCCTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCCGCCTT
TACCTAGTGAGGGCCACCCTGCTTCAGTGTGGGCTGCACAGAACCGAGTGTAGATGGGCC
GCCTTTACCTAGTGAGTGAGGGCCACCCTACCTCAGTGTGGGCTGCACAGAACCGAGTGT
AGACGGGCCGCCTTTACCTAGTGAGCGAGGGCCGCCCTGCCTCAGTGTGGGCTGCACAGA
ACCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGTGAGGGCCGCCCTGCTTCAGTGTGGA
CTGCATGGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTGAGTGAGGGCTGCCCTGCCT
CAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGGGCCGCCC
TGCTTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTGAGTGA
GGGCCGCCCTGCCTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCCGCCTTTACCT
AGTGAGGGCCACCCTGCTTCAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCCGCCTT
TACCTTACAGGAAGTCTTCTCTCCTAGGAACGGAGGCAGCG
 
## @adding time before the command reports:
real	0m9.415s
user	0m0.005s
sys	0m0.007s
 
## this seems long as compared to the local run above !!
## but consider that you did not need to download the full data first 
## which could take up to 30min with a slow internet.

SAM | BAM

Compress SAM <to> BAM

Samtools used below is not limited to the view command. Please refer to the samtools page for more information[9]

the samtools view command parameters

Usage: samtools view [options] <in.bam>
# SAM to BAM
samtools view -bS aln.sam > aln.bam 
 
# BAM to SAM without header
samtools view sample.bam > sample.sam
 
# extract a region (but no header)
samtools view sample_sorted.bam "chr1:10-13"
 
# header also added to the output
samtools view -h sample_sorted.bam "chr1:10-13"
 
# header added and result stored in bam
samtools view -h -b sample_sorted.bam "chr1:10-13" > tiny_sorted.bam

Tabular coordinate-based data

compress tabulat formats with bgzip

Tabular formats are frequents in NGS. They include GTF, GFF, BED, VCF, and more. Such files become often large and it is very convenient to archive them and allow rapid access through indexing.

the bgzip command parameters

Usage:   bgzip [options] [file] ...
 
Options: -c        write on standard output, keep original files unchanged
         -d        decompress
         -f        overwrite files without asking
         -I FILE   name of the index file [file.gz.gzi]
         -i        compress and create BGZF index
         -r        (re)index bgzipped file
         -b INT    decompress at virtual (uncompressed) file pointer INT (0 based)
         -s INT    decompress INT bytes in the uncompressed file
         -h        give this help

Bgzip is commonly used like gzip

bgzip -c plain-text.file > archive.(b)gz

extract 'gff, bed, sam, vcf, psltbl' data with tabix

Tabix allows rapid retrieval of files indexed with it from text or bgzip archive versions.

the tabix command parameters

Program: tabix (TAB-delimited file InderXer)
Version: 0.2.5 (r1005)
 
Usage:   tabix <in.tab.bgz> [region1 [region2 [...]]]
 
Options: -p STR     preset: gff, bed, sam, vcf, psltbl [gff]
         -s INT     sequence name column [1]
         -b INT     start column [4]
         -e INT     end column; can be identical to '-b' [5]
         -S INT     skip first INT lines [0]
         -c CHAR    symbol for comment/meta lines [#]
         -r FILE    replace the header with the content of FILE [null]
         -B         region1 is a BED file (entire file will be read)
         -0         zero-based coordinate
         -h         print also the header lines
         -H         print only the header lines
         -l         list chromosome names
         -f         force to overwrite the index

Technical.png Note the presence of a parameter '-0' for 'zero-based coordinate'

extract 'custom tabular' data with tabix

For other tabular data with coordinate fields, tabix can be tuned to create an index for rapid retrieval. This can be very useful when working with large annotation files like those present in the Annovar database. The 'chromosome', 'start', and 'end' columns do not even need to appear in that order and can be designated independently as shown below.

Annovar intergenic data has format: intergenic NONE(dist=NONE),MIR3648(dist=346401) chr21 9479431 9479431 C T het 21 . with chr in column #3 start position in #4 and end position in #5

infolder="Final_Results/annovar-results"
infile="common_NA18507.variant_function"
 
outfolder="annovar-results"
mkdir -p ${outfolder}
 
# the data should be re-sorted to make sure that it is presented in chr and position order
sort -k 3V,3 -k 4n,4 -k 5n,5 ${infolder}/${infile} >\
    ${outfolder}/srt_${infile}
 
bgzip -c ${outfolder}/srt_${infile} > ${outfolder}/srt_${infile}.gz
 
## index the resulting file
# there is no header to skip or keep here, but both options exist
tabix -f -S 0 -s 3 -b 4 -e 5 ${outfolder}/srt_${infile}.gz
 
## you can now retrieve data from the compressed & indexed file using tabix
tabix ${outfolder}/srt_${infile}.gz chr21:10703925-10708925
 
## resulting in
intergenic	TEKT4P2(dist=735331),TPTE(dist=202262)	chr21	10703925	10703925	T	C	hom	222	.
intergenic	TEKT4P2(dist=735380),TPTE(dist=202213)	chr21	10703974	10703974	C	A	het	225	.
intergenic	TEKT4P2(dist=739135),TPTE(dist=198458)	chr21	10707729	10707729	T	G	het	163	.
intergenic	TEKT4P2(dist=739190),TPTE(dist=198403)	chr21	10707784	10707784	C	T	het	191	.
intergenic	TEKT4P2(dist=739351),TPTE(dist=198242)	chr21	10707945	10707945	C	T	het	18.1	.
intergenic	TEKT4P2(dist=739352),TPTE(dist=198241)	chr21	10707946	10707946	A	G	het	225	.
intergenic	TEKT4P2(dist=739577),TPTE(dist=198016)	chr21	10708171	10708171	A	C	hom	222	.

extract VCF data from the WEB (1000genome)

# as 1000g is based on B37 io hg19, we need to remove 'chr' in the address and 'hope' that coordinates are identical between builds hg19 and B37
b37_region="21:33031935-33041243"
outfolder="data_extraction"
 
mkdir -p ${outfolder}
 
cd ${outfolder} && \
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ${b37_region} |\
    bgzip -c  > SOD_1000g.vcf.gz
 
# and index the result
tabix -p vcf SOD_1000g.vcf.gz
 
# count obtained rows
zgrep -c "^21" SOD_1000g.vcf.gz
# -> 71
 
# inspect first columns of top-9 rows
zgrep -v "^##" SOD_1000g.vcf.gz | cut -f 1-6 | head
 
# return to the 'basecamp'
cd $BASE

extract from the 1000g download

#CHROM	POS	ID	REF	ALT	QUAL
21	33031974	rs7277748	A	G	.
21	33031996	.	G	C	.
21	33032035	.	T	A	.
21	33032287	rs17881180	C	T	.
21	33032857	rs114905802	C	T	.
21	33032895	.	C	G	.
21	33032896	rs17884260	T	G	.
21	33032988	rs6650814	C	A	.
21	33033001	rs4816405	C	G	.

References:
  1. http://samtools.sourceforge.net
  2. 2.0 2.1 https://github.com/samtools/hts-specs
  3. http://www.broadinstitute.org/software/igv/interpreting_pair_orientations
  4. http://picard.sourceforge.net/explain-flags.html
  5. http://en.wikipedia.org/wiki/Variant_Call_Format
  6. http://vcftools.sourceforge.net
  7. http://vcftools.sourceforge.net/VCF-poster.pdf
  8. http://sourceforge.net/p/samtools/mailman/message/27980791/
  9. http://samtools.sourceforge.net/samtools.shtml

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS Exercise.1 ]