NGS variant analysis custom functions and code

From BITS wiki
Jump to: navigation, search

[ back to Hands-on introduction to NGS variant analysis ]


Custom functions and pieces of code assembled to make your life easier when it comes to perform repetitive tasks. Information will be given during the course to use this efficiently. custom functions are edited and added to a **.myfunctions** file created in the home folder and sourced at startup in your **.bashrc** (aka .profile ...) to be always available in bash. Type **declare -F** to see all functions available on your machine (try **listfunc** too). In order to use custom functions within bash scripts, you will need to source the .myfunction file at the beginning of your script with '. $HOME/.myfunctions'.

Custom functions can take arguments just like shell scripts and can really be complex and very useful.

Picard .dict to .genome (Bedtools)

Takes a .dict file created by Picard and makes a text chromosome name + length file required for Bedtools

function dict2genome {
# create .genome file from PICARD .dict file
echo "# creating ${1%\.dict}.genome"
grep "^@SQ" ${1} | awk '{split($2,name,":"); split($3,len,":"); print name[2]"\t"len[2]}' > ${1%\.dict}.genome
}

splitline

Split lines after a given length to better render them for inclusion in web-documents. The argument defines the length of the default value of 60 is used. This command can be piped after another command to format the output.

splitline ()
{
# split lines in pipe after n char (default=60)
# usage command | command | splitline 70
# the € char is used to temporarily convert tabs to 1-char length
# tabs are restored at the end
lim=${1:-60}
tr "\t" "€" | sed -e "s/.\{${lim}\}/&\n/g" | tr "€" "\t"
}

As always, there is a command called fold to 'Wrap input lines to fit in specified width' in the GNU coreutils (http://www.gnu.org/software/coreutils/).

fold invocation

Usage: fold [OPTION]... [FILE]...
Wrap input lines in each FILE (standard input by default), writing to
standard output.
 
Mandatory arguments to long options are mandatory for short options too.
  -b, --bytes         count bytes rather than columns
  -s, --spaces        break at spaces
  -w, --width=WIDTH   use WIDTH columns instead of 80
      --help     display this help and exit
      --version  output version information and exit
 
type "info coreutils 'fold invocation'" for more examples

numcols

Take a line (1st by default) of a table and output columns as lines with leading number

function numcols () {
head -${2:-1} ${1} | tail -1 | transpose -t | awk '{print NR,$0}'
}

addchr | remchr

A function to add / remove 'chr' in chromosome names (hg19 vs B37 builds) from **VCF** or **BED** files (provided as unique argument)

function addchr() {
# add chr in front of line starting with 0-9XYM
# should work with most BED and VCF data
if [ "$#" -lt 1 ]
then
cat | gawk 'BEGIN{FS="\t"; OFS="\t"}
        {if ($1~/^MT/){
                gsub("MT","chrM",$1);
                print $0
                } else if ($1~/^[0-9XYM]/)
                {gsub($1,"chr"$1,$1);
                print $0
                } else print $0
        }'
else gawk 'BEGIN{FS="\t"; OFS="\t"}
        {if ($1~/^MT/){
                gsub("MT","chrM",$1);
                print $0
                } else if ($1~/^[0-9XYM]/)
                {gsub($1,"chr"$1,$1);
                print $0
                } else print $0
        }' $1
fi
}


function remchr() {
# removes 'chr" from front of lines starting with chr
# should work with BED and VCF
if [ "$#" -lt 1 ]
then
cat | gawk 'BEGIN{FS="\t"; OFS="\t"}
        {if ($1~/^chrM/){
                gsub("chrM","MT",$1);
                print $0
            } else if ($1~/^chr/){
                gsub("chr","",$1);
                print $0
            } else print $0
        }'
else
gawk 'BEGIN{FS="\t"; OFS="\t"}
        {if ($1~/^chrM/){
                gsub("chrM","MT",$1);
                print $0
            } else if ($1~/^chr/){
                gsub("chr","",$1);
                print $0
            } else print $0
        }' $1
fi
}

vcf2index

This custom bash function performs a frequent operation consisting in sorting by chromosome and position, compressing, and indexing a vcf file.

function vcf2index () 
{ 
# accepts both piped content or a vcf file as input
# writes header through and sorts remaining lines
# compresses with bgzip, indexes with tabix 
# and cleans up temp file (piped version)
    if [ $# -eq 1 ]; then
        title=$(basename "$1" ".vcf");
        ( grep ^"#" $1 | perl -pi -e "s/$title.*$/$title/";
        grep -v ^"#" $1 | sort -k 1V,1 -k 2n,2 ) | \
        	bgzip -c > $1".gz" && tabix -p vcf $1".gz";
    else
        cat > .tmpf;
        outfile="sorted-data.vcf.gz";
        ( grep ^"#" .tmpf;
        grep -v ^"#" .tmpf | sort -k 1V,1 -k 2n,2 ) | \
        	bgzip -c > ${outfile} && ( rm .tmpf;
        tabix -p vcf ${outfile} );
    fi
}

Alternatively, you can type the following bash command to mimic it in terminal for each new VCF file you need to process ...

grep ^"#" file.vcf | grep -v ^"#" $1 | sort -k 1V,1 -k 2n,2 | bgzip -c > \
	file.vcf.gz && tabix -p vcf file.vcf.gz"

Handicon.png Note that the second alternative involves quite some typing.

vcf2mnp-indels.sh

This custom script splits a compressed vcf-file into MNP variants and indels as these variants are often analyzed separately.

#!/bin/bash
# split vcf file into MNP and indel fractions
# SP-BITS: added '.' to allowed one nucleotide ins or del call
# awk code from: Pierre Lindenbaum
# http://biostar.stackexchange.com/questions/7426/
# how-to-separate-snp-variants-from-indel-variants-in-the-same-vcf-file
#
# Stéphane Plaisance - VIB-BITS - May-07-2013 v1.0
 
# check parameters
if [[ $# != 1 ]]
then 
echo "Usage: ${0##*/} variants.vcf.gz";
exit 1
fi
 
pref=$(basename $1 ".vcf.gz")
 
gzip -cd $1 | gawk -v pref=${pref} '
/.*/ {
        # wave header to both files
        if ($0~/^#/) {
                print $0 > pref"-snv.vcf"; 
                print $0 > pref"-indels.vcf"; 
        # both call fields are one nucleotide long or '.'
        } else if ($0~/^[^\t]+\t[0-9]+\t[^\t]*\t[atgcATGC\.]\t[a-zA-Z\.]\t/) {
                print $0 > pref"-snv.vcf"; 
        # one+ call is longer than 1 base
        } else {
                print $0 > pref"-indels.vcf"; 
        }
}'
 
# compress results with bgzip and index with tabix
bgzip ${pref}"-snv.vcf" && tabix -p vcf ${pref}"-snv.vcf.gz"
bgzip ${pref}"-indels.vcf" && tabix -p vcf ${pref}"-indels.vcf.gz"

A couple of recent vcftools additional commands can be substituted to this script. Please follow <http://vcftools.sourceforge.net/options.html> for more vcftools syntax information. Include or exclude sites that contain an indel. For this option 'indel' means any variant that alters the length of the REF allele.

split VCF by type

# split VCF in SNVs and Indels with vcftools (recent version only!)
# extraction adds .recode to the name, the second command removes it
# by using a powerful bash string manipulation operation
 
infile=some-file.vcf.gz
 
vcftools --gzvcf ${infile} --remove-indels --out SNPs_${infile} --recode && \
	mv SNPs_${infile} SNPs_${infile/\.recode/} && \
	vcf2index SNPs_${infile}
 
vcftools --gzvcf file.vcf.gz --keep-only-indels --out INDELs_file --recode && \
	mv INDELs_${infile} INDELs_${infile/\.recode/} \
	vcf2index INDELs_${infile}


Creating colorfull VENN plots in R

A recent package allows creating venn plots from the list of counts in the distinct regions. Such lists are obtained when running vcf-compare and can easily be fed to R to plot the results.

The complete documentation for that package can be reached on CRAN <http://cran.r-project.org/web/packages/colorfulVennPlot/>

Scripts are provided to create 2D, 3D, and 4D plots on our Github repository https://github.com/BITS-VIB/venn-tools[1]. Users can run these scripts (made executable) and specify each Venn count and additional parameters to obtain grey or color plots to insert in their reports. A typical call for a 4-way plot would be:

4DVenn.R -a 1 -b 2 -c 3 -d 4 -e 5 -f 6 -G 7 -i 8 -j 9 \
   -k 10 -l 11 -m 12 -n 13 -p 14 -q 15 -f "4way-Venn"\
   -A "grA" -B "grB" -C "grC" -D "grD" \
   -o "my_4wayVenn" -u 1 -x 1

4Dvenn.R script

#!/usr/bin/RScript
 
# plots a 4D VENN from precomputed data
# usage: 4Dvenn.R -a .. to .. -u (see below)
 
# counts are expected in the order
# "0100","0010","0110",1100",
# "0011","1000","1110","0111",
# "0001","1111","1010","0101",
# "1001" 
# the order is arbitrary, from top to bottom and from left to right
# extra arguments: A-label B-label C-label D-label Title (opt)
 
# example command:
# 4DVenn.R -a 1 -b 2 -c 3 -d 4 -e 5 -f 6 -G 7 -i 8 -j 9 
#   -k 10 -l 11 -m 12 -n 13 -p 14 -q 15 
#   -A "grA" -B "grB" -C "grC" -D "grD" -t "4way-Venn" 
#   -o "my_4wayVenn" -u 1 -x 1
#
# Stephane Plaisance VIB-BITS April-11-2014 v1.0
 
# required R-packages
# once only install.packages("grid")
# once only install.packages("optparse")
# once only install.packages("colorfulVennPlot")
suppressPackageStartupMessages(library("optparse"))
suppressPackageStartupMessages(library("grid"))
suppressPackageStartupMessages(library("colorfulVennPlot"))
 
#####################################
### Handle COMMAND LINE arguments ###
#####################################
 
# parameters
option_list <- list(
  make_option(c("-a", "--a.count"), type="integer", default=0,
              help="counts for A-only [default: %default]"),
  make_option(c("-b", "--b.count"), type="integer", default=0,
              help="counts for B-only [default: %default]"),
  make_option(c("-c", "--c.count"), type="integer", default=0, 
              help="counts for C-only [default: %default]"),
  make_option(c("-d", "--d.count"), type="integer", default=0, 
              help="counts for D-only [default: %default]"),  
  make_option(c("-e", "--ab.count"), type="integer", default=0,
              help="counts for AB-intersect [default: %default]"),  
  make_option(c("-f", "--ac.count"), type="integer", default=0,
              help="counts for ACB-intersect [default: %default]"),   
  make_option(c("-G", "--ad.count"), type="integer", default=0,
              help="counts for AD-intersect (! use G and not g, due to a bug in RScript) [default: %default]"), 
  make_option(c("-j", "--bc.count"), type="integer", default=0,
              help="counts for BC-intersect [default: %default]"), 
  make_option(c("-k", "--bd.count"), type="integer", default=0,
              help="counts for BD-intersect [default: %default]"), 
  make_option(c("-l", "--cd.count"), type="integer", default=0,
              help="counts for CD-intersect [default: %default]"),  
  make_option(c("-m", "--abc.count"), type="integer", default=0,
              help="counts for ABC-intersect [default: %default]"), 
  make_option(c("-n", "--abd.count"), type="integer", default=0,
              help="counts for ABD-intersect [default: %default]"),  
  make_option(c("-p", "--acd.count"), type="integer", default=0,
              help="counts for ACD-intersect [default: %default]"), 
  make_option(c("-q", "--bcd.count"), type="integer", default=0,
              help="counts for BCD-intersect [default: %default]"),  
  make_option(c("-i", "--abcd.count"), type="integer", default=0,
              help="counts for ABCD-intersect [default: %default]"), 
  make_option(c("-A", "--a.label"), type="character", default="A", 
              help="label for A [default: %default]"),
  make_option(c("-B", "--b.label"), type="character", default="B", 
              help="label for B [default: %default]"),
  make_option(c("-C", "--c.label"), type="character", default="C", 
              help="label for C [default: %default]"),
  make_option(c("-D", "--d.label"), type="character", default="D", 
              help="label for D [default: %default]"),
  make_option(c("-t", "--title"), type="character",
              help="Graph Title [default: null]"),
  make_option(c("-x", "--format"), type="integer", default=1,
              help="file format for output 1:PNG, 2:PDF [default: %default]"),
  make_option(c("-o", "--file"), type="character", default="4Dvenn",
              help="file name for output [default: %default]"),
  make_option(c("-u", "--fill"), type="character", default="3",
              help="fill with 1:colors, 2:greys or 3:white [default: %default]")          
)
 
# PARSE OPTIONS
opt <- parse_args(OptionParser(option_list=option_list))
 
# check if arguments provided
if (length(opt) > 1) {
 
# data
y=c(opt$b.count, opt$c.count,
    opt$bc.count,
    opt$ab.count, opt$cd.count,
    opt$a.count, opt$abc.count, opt$bcd.count, opt$d.count,
    opt$abcd.count,
    opt$ac.count, opt$bd.count,
    opt$acd.count, opt$abd.count,
    opt$ad.count)
 
names(y) <- c("0100","0010",
              "0110",
              "1100","0011",
              "1000","1110","0111","0001",
              "1111",
              "1010","0101",
              "1011","1101",
              "1001")
 
# labels
labels <- c(opt$a.label, opt$b.label, opt$c.label, opt$d.label)
 
# colors (15)
ncol <- 15
my.colors <- rainbow(ncol)
my.greys <- rev(gray(0:32 / 32))[1:ncol]
my.whites <- rep("#FFFFFF", ncol)
my.fill <- ifelse( rep(opt$fill=="1", ncol), 
				my.colors, 
				(ifelse(rep(opt$fill=="2",ncol), my.greys, my.whites))
				)
 
# title
my.title <- ifelse(!is.null(opt$title), opt$title, "") 
 
# format
if (opt$format==1){
# png
filename <- paste(opt$file, ".png", sep="")
png(file = filename, bg = "transparent")
 
} else {
# pdf
filename <- paste(opt$file, ".pdf", sep="")
pdf(file = filename, bg = "white")
}
 
# plot
plot.new()
plotVenn4d(y, 
          labels,
          Colors = my.fill, 
          Title = my.title)
dev.off()
}
my_4wayVenn.png

filter unpaired reads from a - read-name sorted - BAM file (bam_re-pair.pl)

#!/usr/bin/perl -w
 
# filter unpaired reads from a - read-name sorted - BAM file
# title: bam_re-pair.pl
# version: 2016-02-18 (added support for /1 or /2 after read name)
# author: Stephane Plaisance (translated from python version by Devon Ryan)
# http://seqanswers.com/forums/showthread.php?p=118936#post118936
# usage:
# samtools view -h <name_sorted.bam> | \
#	bam_re-pair.pl | \
#	samtools view -bSo <name_sorted.filtered.bam> -
 
use warnings;
use strict;
 
# variables
my $read = "";
my $read1 = "none";
my $read2 = "none";
my $name1 = "none";
my $name2 = "none";
my ($ln,$ok,$no)=(0,0,0);
 
while (my $read = <>) {
 
# forward header lines
if ($read =~ /^@/){
	print STDOUT $read;
	next;
	}
 
# process data
$ln++;
if( $name1 eq "none" ){
	$read1 = $read;
    $name1 = (split("\t", $read1))[0];
	} else {
		$name2 = (split("\t", $read))[0];
		if( $name1 eq $name2 ){
			# is paired
			$ok++;
 
			# add index to read names if absent
			if ($name1 !~ /\/1$/){
				$read1 =~ s/$name1/$name1\/1/; 
				}
			if ($name2 !~ /\/2$/){
				$read =~ s/$name2/$name2\/2/; 
				}
 
			print STDOUT sprintf("%s%s", $read1, $read);
			$read1 = "none";
			$name1 = "none";
			} else {
				# is not paired
				$no++;
				$read1 = $read;
				$name1 = (split("\t", $read1))[0];
				}
	}
}
 
# report counts with nice alignmenment
print STDERR "\n############################\n# Results\n";
print STDERR sprintf("%-18s%10d\n", "# processed:", $ln);
print STDERR sprintf("%-18s%10d\n", "# passed-pairs:", $ok);
print STDERR sprintf("%-18s%10d\n", "# failed-reads:", $no);
exit 0;

The pieces of code below are alternative ways of getting data that were removed from the main text and are assembled here as a source of inspiration for the users. Beware that some command may seem to lead to the same result but in fact present slight functional differences.

Creating a 10% subset from a BAM file with **samtools**

#! /usr/bin/env bash
# samtools Version: 0.1.19+
data=data-folder
bamfile=mappings
result=bam_subset
mkdir -p ${result}
 
# '0.1' will keep 10% of the bam records
samtools view -s 0.1 -b \
	${data}/${bamfile}.bam \
	> ${data}/samtools-sample_${bamfile}.bam \
	2>${data}/samtools-sample.err
 
# bam data is often better sorted and indexed
samtools sort ${data}/samtools-sample_${bamfile}.bam \
	${result}/srt_samtools-sample_${bamfile} && \
	samtools index ${result}/srt_samtools-sample_${bamfile}.bam

Get chromosome length from UCSC for the human genome

# hg19 (similar to GRCh37)
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo order by chrom" > hg19.chromSizes.txt
 
# GRCh38 => hg38 and not hg20 as one could expect (yes we can!)
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg38.chromInfo order by chrom" > hg38.chromSizes.txt

[ back to Hands-on introduction to NGS variant analysis ]
  1. https://github.com/BITS-VIB/venn-tools