Create a GC content track

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS_data_analysis ]

Create a custom 'GC'% track for your reference genome


(A hg19 GC track can be loaded from the IGV server but only for a 5bps averaging window )

Aim - create a custom GC percent genome wide track

Local excess of GC or AT bases will result in altered pairing of the DNA molecule and introduce bias in analysis or experimental setup that rely on DNA hybridozation (oligo annealing, denaturation, alignement). It can be useful to identify local unbalance in base composition or display it in a genome context. The short howto creates a GC% track for th full human genome (hg19) but can be used as guide for any other genome for which you know the sequence.

Handicon.png this page relates to Create a mappability track


Download the yeast reference genome data from the UCSC table repository

We use below Kent tools (unix version from [2], also available for macOSX[3]) used by the UCSC team to produce the Genome browser system. Some of these tools should be present on your machine in order to repeat the code below.

Handicon.png If you do not have Kent tools installed, or if your reference genome is absent from the UCSC repository, You can download it in Fasta format. This means that you can apply this method to contigs or BACs provided you have their sequence - if you choose to download the fasta sequence, please proceed to the next section

# create a folder to store all results
mkdir –p ${basefolder}
# download the 2bit file
wget -P ${basefolder} 
# extract the fasta data with Kent's 'twoBitToFa'
twoBitToFa ${basefolder}/hg19.2bit ${basefolder}/hg19.fa
# inspect fasta headers
grep "^>" ${basefolder}/hg19.fa
# also create a two-column list of chromosomes names and sizes with Kent's 'faSize'
faSize ${basefolder}/hg19.fa -detailed > ${basefolder}/hg19.sizes

alternate method to get sequence sizes without Kent tools

# generate a fasta index (including sizes in second column)
samtools faidx ${basefolder}/hg19.fa
# extract sequence names (chromosomes) and their lengths to a new file
cut -f 1,2 ${basefolder}/hg19.fa.fai > ${basefolder}/hg19.sizes

create a N-base wide step file from the chromosome list with Bedtools makewindows

The scanning method chosen here uses a stepping window but you can also generate sliding windows using Bedtools [4] (read the DOC[5]).

# create a folder to store all results
mkdir –p ${basefolder}
# make 1kb wide bins
# create a BED file with windows of wished width
bedtools makewindows \
	-g ${basefolder}/hg19.sizes \
	-w ${width} \
	> ${basefolder}/hg19_${width}bps.bed

compute base frequencies across all bins with BedTools nuc

The following information will be reported after each BED entry by bedtools nuc

command help page

Tool:    bedtools nuc (aka nucBed)
Version: v2.19.1
Summary: Profiles the nucleotide content of intervals in a fasta file.
Usage:   bedtools nuc [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>
	-fi	Input FASTA file
	-bed	BED/GFF/VCF file of ranges to extract from -fi
	-s	Profile the sequence according to strand.
	-seq	Print the extracted sequence
	-pattern	Report the number of times a user-defined sequence
			is observed (case-sensitive).
	-C	Ignore case when matching -pattern. By defaulty, case matters.
	-fullHeader	Use full fasta header.
		- By default, only the word before the first space or tab is used.
Output format: 
	The following information will be reported after each BED entry:
	    1) %AT content
	    2) %GC content
	    3) Number of As observed
	    4) Number of Cs observed
	    5) Number of Gs observed
	    6) Number of Ts observed
	    7) Number of Ns observed
	    8) Number of other bases observed
	    9) The length of the explored sequence/interval.
	    10) The seq. extracted from the FASTA file. (opt., if -seq is used)
	    11) The number of times a user's pattern was observed.
	        (opt., if -pattern is used.)
# create a folder to store all results
mkdir –p ${basefolder}
bedtools nuc \
	-fi ${basefolder}/hg19.fa \
	-bed ${basefolder}/hg19_${width}bps.bed  \
	> ${basefolder}/hg19_nuc_${width}.txt

Cli tools.png bedtools takes care of producing the required fasta index if this one is absent

extract GC% results to a new IGV-formatted file for visualization

To obtain a valid .igv format, we need the classical 'chr', 'start', and 'end' columns together with a 'title' column and the GC % values from column #5 of the bedtools nuc output. Our best friend Awk takes care of providing all this in no time.

# width was defined earlier and was set to 1000, hence the title column filled with 'GCpc_1000bps'
gawk -v w=${width} 'BEGIN{FS="\t"; OFS="\t"}
if (FNR>1) {print $1,$2,$3,"GCpc_"w"bps",$5}
}' ${basefolder}/hg19_nuc_${width}.txt > ${basefolder}/hg19_${width}bps.igv

create a binary version with IGVtools for fast streaming

Handicon.png For a large reference genome and/or a small width, the IGV file will grow quite big. It is a good practice to use igvtools to derive a .tdf version thereof for rapid visualization and streaming in IGV

igvtools command help page

Command "toTDF"
The "toTDF" command converts a sorted data input file to  a binary tiled
data (.tdf) file. Input file formats supported  are .wig, .cn, .igv,
and .gct, TCGA mage-tab files, and "list" files.
List files are text files containing a list of files in one of the supported formats,
one file per line. When using a list file the format of the contained files must be
specified explicitly with the "fileType" parameter.  List files must end with the
extension ".list".  File paths can be absolute or relative to the directory containing
the list file.
  igvtools toTDF [options]  [inputFile] [outputFile] [genome]
Required arguments:
  inputFile    The input file (see supported formats above).
  outputFile   Binary output file.  Must end in ".tdf".
  genome       A genome id or filename. See details below. Default is hg18.
  -z, --maxZoom num       Specifies the maximum zoom level to precompute. The default
               value is 7 and is sufficient for most files. To reduce file
               size at the expense of IGV performance this value can be
  -f, --windowFunctions  list     A comma delimited list specifying window functions to use
               when reducing the data to precomputed tiles.   Allowed
               values are  min, max,  mean, median, p2, p10, p90, and p98.
               The "p" values represent percentile, so p2=2nd percentile,
  -p, --probeFile file      Specifies a "bed" file to be used to map probe identifiers
               to locations.  This option is useful when preprocessing gct
               files.  The bed file should contain 4 columns:
                  chr start end name
               where name is the probe name in the gct file.
  --fileType   Explicitly specify the file type.  This is a required parameter  for TCGA mage-tab and ".list" files.
               Possible values are mage-tab, .wig, .cn, .igv, and .gct.   Only mage-tab files downloaded from the
               TCGA data center or related sights are supported at this time.
  Conversion of ".gct" and "mage-tab" files results in the creation of an ".igv" file, which is sorted by genome
  position using the "sort" command.  For this case the following optional parameters can be specified.
  -t, --tmpDir tmpdir  Specify a temporary working directory.  For large input files
               this directory will be used to store intermediate results of
               the sort. The default is the users temp directory.
  -m, --maxRecords number  The maximum number of records to keep in memory during the
               sort.  The default value is 500000.  Increase this number
               if you receive "too many open files" errors.   Decrease it
               if you experience "out of memory" errors.
      igvtools toTDF -z 5 copyNumberFile.tdf hg18
Data file formats, with the exception of .gct files, must be sorted by
start position.  If necessary files can be sorted with the "sort" command
described below.  Attempting to preprocess an unsorted file will result
in an  error.
See, or igvtools_readme.txt, for more help

Cli tools.png The input data should be sorted but this is normally the case given how we got to this point

# the '-z' (--maxZoom) value decides from which zoom level data will become visible (default=7)
# the '-f' allowed values are:
#  min, max,  mean, median, p2, p10, p90, and p98.
#  the "p" values represent percentile, so p2=2nd percentile,
igvtools toTDF \
	-z 5 \
	-f min,max,mean \
	${basefolder}/hg19_${width}bps.igv \
	${basefolder}/hg19_${width}bps.tdf \
# replace hg19 by your-genome.fasta

Results as visualized in IGV

A screen-shot is provided to illustrate the results using IGV and the hg19 loaded genome. The chr4 telomeric region including the Dux2 / Dux4 region is selected as being a known GC rich spot. Several visualizations are presented using histogram, line or heatmap. Use the one that best fits your needs.



It is obvious that many other tools can achieve this goal, we selected tools that are quite popular, easy to install and performant as a matter of execution time. Feed back on this page is welcome and can be sent to


[ Main_Page | NGS_data_analysis ]