HTSeq
HTSeq is a Python package for processing data from high-throughput sequencing assays
[ Main_Page ]
Besides the detailed python framework accessible for programming, HT-Seq[1] also groups two standalone executables detailed below. Additional online tutorial and manuals are available [2]
Contents
htseq-count: create count table
The counting application of HT-Seq is used both for ChIP-Seq and RNA-Seq applications and can report coverage in different ways illustrated in te next figure. The command takes an alignment file in SAM format and a feature file in GFF format and calculates for each feature the number of reads mapping to it. Note: there are three different methods for handling overlapping features (union, intersection-strict, intersection-noempty). This is an important option, which should always be set. See http://www-huber.embl.de/users/anders/HTSeq/doc/count.html for details.
Usage: htseq-count [options] sam_file gff_file
command help
htseq-count Usage: htseq-count [options] alignment_file gff_file This script takes an alignment file in SAM/BAM format and a feature file in GFF format and calculates for each feature the number of reads mapping to it. See http://www-huber.embl.de/users/anders/HTSeq/doc/count.html for details. Options: -h, --help show this help message and exit -f SAMTYPE, --format=SAMTYPE type of <alignment_file> data, either 'sam' or 'bam' (default: sam) -r ORDER, --order=ORDER 'pos' or 'name'. Sorting order of <alignment_file> (default: name). Paired-end sequencing data must be sorted either by position or by read name, and the sorting order must be specified. Ignored for single- end data. -s STRANDED, --stranded=STRANDED whether the data is from a strand-specific assay. Specify 'yes', 'no', or 'reverse' (default: yes). 'reverse' means 'yes' with reversed strand interpretation -a MINAQUAL, --minaqual=MINAQUAL skip all reads with alignment quality lower than the given minimum value (default: 10) -t FEATURETYPE, --type=FEATURETYPE feature type (3rd column in GFF file) to be used, all features of other type are ignored (default, suitable for Ensembl GTF files: exon) -i IDATTR, --idattr=IDATTR GFF attribute to be used as feature ID (default, suitable for Ensembl GTF files: gene_id) -m MODE, --mode=MODE mode to handle reads overlapping more than one feature (choices: union, intersection-strict, intersection- nonempty; default: union) -o SAMOUT, --samout=SAMOUT write out all SAM alignment records into an output SAM file called SAMOUT, annotating each line with its feature assignment (as an optional field with tag 'XF') -q, --quiet suppress progress report Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General Public License v3. Part of the 'HTSeq' framework, version 0.6.0.
reading from pipe
To read from standard input, use - as <sam_file>.
htseq-qa: quality analysis
If the file htseq-qa is not in your path, you can, alternatively, call the script with python -m HTSeq.scripts.qa [options] read_file. The read_file is either a FASTQ file or a SAM file. For a SAM file, a plot with two columns is produced as above, for a FASTQ file, you get only one column.The output is written into a file with the same name as read_file, with the suffix .pdf added. View it with a PDF viewer such as the Acrobat Reader.
Usage: htseq-qa [options] read_file
command help
htseq-qa Usage: htseq-qa [options] read_file This script take a file with high-throughput sequencing reads (supported formats: SAM, Solexa _export.txt, FASTQ, Solexa _sequence.txt) and performs a simply quality assessment by producing plots showing the distribution of called bases and base-call quality scores by position within the reads. The plots are output as a PDF file. Options: -h, --help show this help message and exit -t TYPE, --type=TYPE type of read_file (one of: sam [default], bam, solexa- export, fastq, solexa-fastq) -o OUTFILE, --outfile=OUTFILE output filename (default is <read_file>.pdf) -r READLEN, --readlength=READLEN the maximum read length (when not specified, the script guesses from the file -g GAMMA, --gamma=GAMMA the gamma factor for the contrast adjustment of the quality score plot -n, --nosplit do not split reads in unaligned and aligned ones -m MAXQUAL, --maxqual=MAXQUAL the maximum quality score that appears in the data (default: 41) Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General Public License v3. Part of the 'HTSeq' framework, version 0.6.0.
supported types
sam: a SAM file (Note that the SAMtools contain Perl scripts to convert most alignment formats to SAM) solexa-export: an _export.txt file as produced by the Solexa Pipeline software after aligning with Eland (htseq-qaexpects the new Solexa quality encoding as produced by version 1.3 or newer of the Solexa Pipeline) fastq: a FASTQ file with standard (Sanger or Phred) quality encoding solexa-fastq: a FASTQ file with Solexa quality encoding, as produced by the Solexa Pipeline after base-calling with Bustard (htseq-qa expects the new Solexa quality encoding as produced by version 1.3 or newer of the Solexa Pipeline)
References:
- ↑ http://biorxiv.org/content/biorxiv/early/2014/02/20/002824.full.pdf
- ↑ http://www-huber.embl.de/users/anders/HTSeq/doc/tour.html