HTSeq

From BITS wiki
Jump to: navigation, search

HTSeq is a Python package for processing data from high-throughput sequencing assays


[ Main_Page ]


Besides the detailed python framework accessible for programming, HT-Seq[1] also groups two standalone executables detailed below. Additional online tutorial and manuals are available [2]

htseq-count: create count table

The counting application of HT-Seq is used both for ChIP-Seq and RNA-Seq applications and can report coverage in different ways illustrated in te next figure. The command takes an alignment file in SAM format and a feature file in GFF format and calculates for each feature the number of reads mapping to it. Note: there are three different methods for handling overlapping features (union, intersection-strict, intersection-noempty). This is an important option, which should always be set. See http://www-huber.embl.de/users/anders/HTSeq/doc/count.html for details.

count_modes.png

Usage: htseq-count [options] sam_file gff_file

command help

htseq-count
Usage: htseq-count [options] alignment_file gff_file
 
This script takes an alignment file in SAM/BAM format and a feature file in
GFF format and calculates for each feature the number of reads mapping to it.
See http://www-huber.embl.de/users/anders/HTSeq/doc/count.html for details.
 
Options:
  -h, --help            show this help message and exit
  -f SAMTYPE, --format=SAMTYPE
                        type of <alignment_file> data, either 'sam' or 'bam'
                        (default: sam)
  -r ORDER, --order=ORDER
                        'pos' or 'name'. Sorting order of <alignment_file>
                        (default: name). Paired-end sequencing data must be
                        sorted either by position or by read name, and the
                        sorting order must be specified. Ignored for single-
                        end data.
  -s STRANDED, --stranded=STRANDED
                        whether the data is from a strand-specific assay.
                        Specify 'yes', 'no', or 'reverse' (default: yes).
                        'reverse' means 'yes' with reversed strand
                        interpretation
  -a MINAQUAL, --minaqual=MINAQUAL
                        skip all reads with alignment quality lower than the
                        given minimum value (default: 10)
  -t FEATURETYPE, --type=FEATURETYPE
                        feature type (3rd column in GFF file) to be used, all
                        features of other type are ignored (default, suitable
                        for Ensembl GTF files: exon)
  -i IDATTR, --idattr=IDATTR
                        GFF attribute to be used as feature ID (default,
                        suitable for Ensembl GTF files: gene_id)
  -m MODE, --mode=MODE  mode to handle reads overlapping more than one feature
                        (choices: union, intersection-strict, intersection-
                        nonempty; default: union)
  -o SAMOUT, --samout=SAMOUT
                        write out all SAM alignment records into an output SAM
                        file called SAMOUT, annotating each line with its
                        feature assignment (as an optional field with tag
                        'XF')
  -q, --quiet           suppress progress report
 
Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework, version 0.6.0.

reading from pipe

To read from standard input, use - as <sam_file>.

htseq-qa: quality analysis

If the file htseq-qa is not in your path, you can, alternatively, call the script with python -m HTSeq.scripts.qa [options] read_file. The read_file is either a FASTQ file or a SAM file. For a SAM file, a plot with two columns is produced as above, for a FASTQ file, you get only one column.The output is written into a file with the same name as read_file, with the suffix .pdf added. View it with a PDF viewer such as the Acrobat Reader.

Usage: htseq-qa [options] read_file

command help

htseq-qa
Usage: htseq-qa [options] read_file
 
This script take a file with high-throughput sequencing reads (supported
formats: SAM, Solexa _export.txt, FASTQ, Solexa _sequence.txt) and performs a
simply quality assessment by producing plots showing the distribution of
called bases and base-call quality scores by position within the reads. The
plots are output as a PDF file.
 
Options:
  -h, --help            show this help message and exit
  -t TYPE, --type=TYPE  type of read_file (one of: sam [default], bam, solexa-
                        export, fastq, solexa-fastq)
  -o OUTFILE, --outfile=OUTFILE
                        output filename (default is <read_file>.pdf)
  -r READLEN, --readlength=READLEN
                        the maximum read length (when not specified, the
                        script guesses from the file
  -g GAMMA, --gamma=GAMMA
                        the gamma factor for the contrast adjustment of the
                        quality score plot
  -n, --nosplit         do not split reads in unaligned and aligned ones
  -m MAXQUAL, --maxqual=MAXQUAL
                        the maximum quality score that appears in the data
                        (default: 41)
 
Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework, version 0.6.0.

supported types

sam: a SAM file (Note that the SAMtools contain Perl scripts to convert most alignment formats to SAM) solexa-export: an _export.txt file as produced by the Solexa Pipeline software after aligning with Eland (htseq-qaexpects the new Solexa quality encoding as produced by version 1.3 or newer of the Solexa Pipeline) fastq: a FASTQ file with standard (Sanger or Phred) quality encoding solexa-fastq: a FASTQ file with Solexa quality encoding, as produced by the Solexa Pipeline after base-calling with Bustard (htseq-qa expects the new Solexa quality encoding as produced by version 1.3 or newer of the Solexa Pipeline)


References:
  1. http://biorxiv.org/content/biorxiv/early/2014/02/20/002824.full.pdf
  2. http://www-huber.embl.de/users/anders/HTSeq/doc/tour.html



[ BioWare | Main_Page ]