Create a valid gtf file for tophat

From BITS wiki
Jump to: navigation, search


[ Main_Page | NGS_data_analysis ]


Create a '.gtf' annotation file from the UCSC table under CLI

Introduction

A GTF ('gene transfer format') annotation file is required with tophat (cufflinks) when mapping NGS reads to a reference genome and finding soplicing events in teh obtained data. This tabular file contains lines representing transcts with coordinate for exon boundaries and additional information including names. Although several public database sites provide gdf files for download, they are not all equal and some will induce issues iwhen used with tophat or other software.

More information about the GTF and related GFF formats can be found in other pages of this Wiki.

One way to create your own gtf file is to access the UCSC table database and download gdf data from web interface. This is however not a good solution as the web interface will save data using twice gene_IDs instead of alternativeky gene, transcript, and exon IDs.

The command-line access to the same database using a tool from Jim Kents suite allows a much better result as initially found on this page

create the GTF file

Here is the code provided by Assaf Gordon and slightly adapted. The required executable file is available on the UCSC server ([1]). (original post by A.G.)[2]

# 1. download a program called "genePredToGtf" from here:
# http://hgdownload.cse.ucsc.edu/admin/exe/
# place the correct version of the executable somewhere in your path
 
# 2. Create the following file in your home directory:
echo 'db.host=genome-mysql.cse.ucsc.edu
db.user=genomep
db.password=password' > ~/.hg.conf
 
# the file's permissions must be user-only
chmod 0600 ~/.hg.conf
 
# 3. run "genePredToGtf" with any organism and any table that is in "genePred" format:
## mm9/RefSeq Genes
genePredToGtf mm9 refGene refGene.gtf
 
## hg19/Ensemble Genes
genePredToGtf hg19 ensGene ensGene.gtf
 
## hg18/UCSC Known Genes
genePredToGtf hg18 knownGene knownGene.gtf
 
# This will save "refGene.gtf" with all the required attributes 
# (gene_id, gene_name transcript_id, exon_number).
# but still not directly usable with DESeq because of multiple isoforms per gene.

create the GFF version and extract transcript sequences

To obtain the GFF version of the GTF data, use the 'gffread' utility (part of the Cufflinks package: [3]).
Use 'gffread' once more to extract the corresponding transcripts from your genome fasta file.

# create a GFF version of hg19_ensGene.gtf 
gffread -E hg19_ensGene.gtf -o hg19_ensGene.gff
 
# extract transcript sequences from the full genome based on the GTF information
gffread -w hg19_refGene.fa -g hg19.fa hg19_refGene.gtf
 
# read the 'gffread' help page for more

help page

# options used above are flagged with **
gffread <input_gff> [-g <genomic_seqs_fasta> | <dir>][-s <seq_info.fsize>] 
 [-o <outfile.gff>] [-t <tname>] [-r [[<strand>]<chr>:]<start>..<end> [-R]]
 [-CTVNJMKQAFGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]
 [-i <maxintron>] 
 Filters and/or converts GFF3/GTF2 records.
 <input_gff> is a GFF file, use '-' if the GFF records will be given at stdin
 
 Options:
  -g  full path to a multi-fasta file with the genomic sequences
      for all input mappings, OR a directory with single-fasta files
      (one per genomic sequence, with file names matching sequence names)
  -s  <seq_info.fsize> is a tab-delimited file providing this info
      for each of the mapped sequences:
      <seq-name> <seq-length> <seq-description>
      (useful for -A option with mRNA/EST/protein mappings)
  -i  discard transcripts having an intron larger than <maxintron>
  -r  only show transcripts overlapping coordinate range <start>..<end>
      (on chromosome/contig <chr>, strand <strand> if provided)
  -R  for -r option, discard all transcripts that are not fully 
      contained within the given range
  -U  discard single-exon transcripts
  -C  coding only: discard mRNAs that have no CDS feature
  -F  full GFF attribute preservation (all attributes are shown)
  -G  only parse additional exon attributes from the first exon
      and move them to the mRNA level (useful for GTF input)
  -A  use the description field from <seq_info.fsize> and add it
      as the value for a 'descr' attribute to the GFF record
 
  -O  process also non-transcript GFF records (by default non-transcript
      records are ignored)
  -V  discard any mRNAs with CDS having in-frame stop codons
  -H  for -V option, check and adjust the starting CDS phase
      if the original phase leads to a translation with an 
      in-frame stop codon
  -B  for -V option, single-exon transcripts are also checked on the
      opposite strand
  -N  discard multi-exon mRNAs that have any intron with a non-canonical
      splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)
  -J  discard any mRNAs that either lack initial START codon
      or the terminal STOP codon, or have an in-frame stop codon
      (only print mRNAs with a fulll, valid CDS)
 
  -M/--merge : cluster the input transcripts into loci, collapsing matching
       transcripts (those with the same exact introns and fully contained)
  -d <dupinfo> : for -M option, write collapsing info to file <dupinfo>
  --cluster-only: same as --merge but without collapsing matching transcripts
  -K  for -M option: also collapse shorter, fully contained transcripts
      with fewer introns than the container
  -Q  for -M option, remove the containment restriction:
      (multi-exon transcripts will be collapsed if just their introns match,
      while single-exon transcripts can partially overlap (80%))
 
  **-E  expose (warn about) duplicate transcript IDs and other potential 
      problems with the given GFF/GTF records
  -Z  merge close exons into a single exon (for intron size<4)
  **-w  write a fasta file with spliced exons for each GFF transcript
  -x  write a fasta file with spliced CDS for each GFF transcript
  -W  for -w and -x options, also write for each fasta record the exon
      coordinates projected onto the spliced sequence
  -y  write a protein fasta file with the translation of CDS for each record
  -L  Ensembl GTF to GFF3 conversion (implies -F; should be used with -m)
  -m  <chr_replace> is a reference (genomic) sequence replacement table with
      this format:
      <original_ref_ID> <new_ref_ID>
      GFF records on reference sequences that are not found among the
      <original_ref_ID> entries in this file will be filtered out
  -o  the "filtered" GFF records will be written to <outfile.gff>
      (use -o- for printing to stdout)
  -t  use <trackname> in the second column of each GFF output line
  -T  -o option will output GTF format instead of GFF3

References:
  1. http://hgdownload.cse.ucsc.edu/admin/exe/
  2. http://grokbase.com/p/r/bioconductor/12450g7q93/bioc-recommended-gene-model-for-deseq
  3. http://cufflinks.cbcb.umd.edu

[ Main_Page | NGS_data_analysis ]