RSeQC

From BITS wiki
Jump to: navigation, search

Perform various Quality Control checks on FastQ or BAM data obtained by NGS

SimilarTo.png: Qualimap


[ BioWare | Main_Page ]


RSeQC [1] is a script toolbox written in python that performs a number of quality control operations as well as data transformation that can prove very useful for RNASeq data analysis([2]).

"Deep transcriptome sequencing (RNA-seq) provides massive and valuable information about functional elements in the genome. Using RNA-seq data, people can profile gene expression change, interrogate alternative splicing events, uncover novel transcribed regions, detect aberrant transcripts (such as gene fusions) and coding variants, etc. Ideally, transcriptome sequencing should be able to directly identify and quantify all RNA species, small or large, low or high abundance. However, RNA-seq is a complicated, multistep process involving sample preparation, amplification, fragmentation, purification and sequencing. A single improper operation would result in biased or even unusable data. Therefore, it is always a good practice to check the quality of your RNA-seq data before pursuing any directions listed above. Here we developed RSeQC package to comprehensively evaluate RNA-seq datasets generated from clinical tissues or other well annotated organisms such as mouse, fly and yeast. For organisms lacking reference gene models, many modules will not work."

RSeQC accepts 3 file formats as input:

BED file SAM or BAM file Chromosome size file
BED file is tab separated, 12 column, plain text file to represent gene model. SAM or BAM file is used to store reads alignments. SAM is human readable plain text file, while

BAM is binary version of SAM, a compact and index-able representation of reads alignments. Here is an example of SAM file. Most modules automatically recognize BAM and SAM files as input, some modules such as bam2wig.py only supports BAM file as input.

Two column, plain text file. Here is an example chromosome size file for human hg19 assembly.

Use this shell script to download chromosome size file for other genomes.

A detailed page with command examples can be found here [3].

list of commands (in version 2.3.9) and their purpose

  • bam2fq.py: convert BAM to fastq from a pipe.
  • bam2wig.py: converts all types of RNA-seq data in BAM format into wiggle file in one-stop.
  • bam_stat.py: calculate reads mapping statistics for provided BAM or SAM file. This script determines "uniquely mapped reads" from the "NH" tag in BAM/SAM file (please note "NH" is an optional tag, some aligner does NOT provide this tag).
  • clipping_profile.py: estimate clipping profile of RNA-seq reads from BAM or SAM file.
  • divide_bam.py: equally divide BAM file (m alignments) into n parts. Each part contains roughly m/n alignments that are randomly sampled from total alignments.
  • geneBody_coverage.py: check if reads coverage is uniform and if there is any 5’/3’ bias (see: Control equal read distribution across all transcript length using RSeQC).
  • gtf2bed.py: converts Cufflinks gtf to a sorted bed.
  • infer_experiment.py: speculate how RNA-seq sequencing were configured.
  • inner_distance.py: calculate the inner distance (or insert size) between two paired RNA reads.
  • junction_annotation.py: compare detected splice junction to reference gene model. Each detected junction can be assigned to 3 exclusive groups (annotated, complete_novel, partial_novel)
  • junction_saturation.py: check if current sequencing depth is deep enough to perform alternative splicing analyses.
  • overlay_bigwig.py: manipulate two BigWig files.
  • normalize_bigwig.py: normalize all samples to the "same mount of total mapped reads"
  • read_distribution.py: calculate how mapped reads were distributed over genome feature (like CDS exon, 5'UTR exon, 3' UTR exon, Intron, Intergenic regions).
  • read_duplication.py: determine reads duplication rate: 'Sequence based' and 'Mapping based'.
  • read_hexamer.py: calculate hexamer frequency for multiple input files (fasta or fastq).
  • read_GC.py: calculate GC% in two columns; first column is GC%, second column is read count.
  • read_NVC.py: check the nucleotide composition bias.
  • read_quality.py: calculate distributions for base qualities across reads (bar plot + heat map are generated).
  • split_bam.py: provide gene list (bed) and BAM file, this module will split the original BAM file into 3 small BAM files: 1. *.in.bam: reads that are mapped to exon regions of the gene list (or reads consumed by gene list). 2. *.ex.bam: reads that cannot be mapped the exon regions of the original gene list. 3. *.junk.bam: qcfailed reads or unmapped reads.
  • split_paired_bam.py:
  • RPKM_count.py: calculate the raw count and RPKM values for transcript at exon, intron and mRNA level.
  • RPKM_saturation.py: resample a series of subsets from total RNA reads and then calculate RPKM value using each subset. By doing this we are able to check if the current sequencing depth was saturated or not (or if the RPKM values were stable or not) in terms of genes' expression estimation.

Download testing datasets

#Pair-end strand specific (Illumina). BAM file md5sum=fbd1fb1c153e3d074524ec70e6e21fb9
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/Pairend_StrandSpecific_51mer_Human_hg19.bam
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/Pairend_StrandSpecific_51mer_Human_hg19.bam.bai
 
#Pair-end  non-strand specific (Illumina). BAM file md5sum=ba014f6b397b8a29c456b744237a12de
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/Pairend_nonStrandSpecific_36mer_Human_hg19.bam
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/Pairend_nonStrandSpecific_36mer_Human_hg19.bam.bai
 
#Single-end strand specific (SOLiD). BAM file md5sum=b39951a6ba4639ca51983c2f0bf5dfce
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/SingleEnd_StrandSpecific_50mer_Human_hg19.bam
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/SingleEnd_StrandSpecific_50mer_Human_hg19.bam.bai

download gene models

Useful pre formatted files can be downloaded to be used with RSeQC using the following wget commands

#human (hg19/GRCh37)
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_RefSeq.bed.gz
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_Ensembl.bed.gz
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_GENCODE_v14.bed.gz
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_GENCODE_v12.bed.gz
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_UCSC_knownGene.bed.gz
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_Vega.bed.gz
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_AceView.bed.gz
 
#Mouse (mm9)
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/mm9_NCBI37_Ensembl.bed.gz
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/mm9_NCBI37_MGC.bed.gz
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/mm9_NCBI37_Refseq.bed.gz
 
#Mouse (mm10)
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/GRCm38_mm10_Ensembl.bed.gz
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/GRCm38_mm10_MGC.bed.gz
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/GRCm38_mm10_RefSeq.bed.gz
 
#Fly (D. melanogaster) (BDGP R5/dm3)
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/fly_dm3_EnsemblGene.bed.gz
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/fly_dm3_RefSeq.bed.gz
wget http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/fly_dm3_flyBaseGene.bed.gz

additional rRNA references

Handicon.png from the RSeQC site: We only provide rRNA bed files for human and mouse. These ribosome RNAs were downloaded from UCSC table browser, we provide them here to facilitate users with NO WARRANTY in completeness. We are appreciated if user can make current list more complete or provide additional gene list for other species.

wget https://sites.google.com/site/liguowangspublicsite/home/hg19_rRNA.bed
wget https://sites.google.com/site/liguowangspublicsite/home/mm10_rRNA.bed
wget https://sites.google.com/site/liguowangspublicsite/home/mm9_rRNA.bed



References:
  1. Liguo Wang, Shengqin Wang, Wei Li
    RSeQC: quality control of RNA-seq experiments.
    Bioinformatics: 2012, 28(16);2184-5
    [PubMed:22743226] ##WORLDCAT## [DOI] (I p)

  2. http://rseqc.sourceforge.net
  3. http://rseqc.sourceforge.net/#usage-information



[ BioWare | Main_Page ]