Q&A added during the NGS variant analysis training

From BITS wiki
Jump to: navigation, search

Please find here questions raised during the session with some element of answer found with Dr G

[ Main_Page | NGS_data_analysis ]


How to get help for the commands used in the training ?

There are four ways to get documentation / help for a command:

  • command: always start with this one, it will often list all available subcommands if we speak about a toolbox
  • man command: opens the manual
  • command -h: opens the help file
  • command --help: sometimes the -h option doesn't work, then you can try --help

How to reuse previous commands in command line ?

You can use the UP and DOWN arrow keys on the keyboard for this. The last 500 commands that you typed in the command line are stored in a file, called the History. List the History with 'history'. You can navigate the History similarly as you navigate any text file: you use the arrow keys on your keyboard.

  • The UP key goes to the previous command in the History (the previous line in the History file)
  • The DOWN key goes to the next command (the next line in the History file)
  • history lists the history on screen
  • !123 re-executes the command number 123 from the history (picked from the list you just printed)

How to navigate to the start or the end of a command ?

If you have typed a long command and you want to make a change it can be helpful to have shortcuts to go the start or the end of the command:

  • Ctrl + A: go to the start of the command
  • Ctrl + E: go to the end of the command

How to copy and paste in the terminal ?

To copy just select the text you wish to copy with your mouse: you will see that the selected text becomes highlighted. This means that you can paste the selected text.
To paste, (move back to the terminal line if you were in another window and) either use:

  • Shift + Insert
  • click the mouse wheel

For mice without wheels see this article

How to use autocompletion in the terminal ?

To use command-line completion, you have to

  • type the first few characters of a command, program or filename
  • press the Tab key on your keyboard to fill-in the rest of the name
  • from the proposed alternatives, add more text to make your choice unique
  • press Enter to run the command or open the file

How to expand environment variables while typing commands?

In former bash versions one could start typing and use environment variables to designate long path. This is currently not supported for all commands and requires a small trick to work.

Here is an example for java commands where the path to picard tools is stored in the $PICARD variable.


$ java -Xmx4g -jar $PICARD/

then try expansion to all programs in that folder with: Tab Tab (twice)

# => all what you get is
$ java -Xmx4g -jar \$PICARD/

NOW! hit: Esc followed by Ctrl+E

$PICARD is now replaced by its full path
$ java -Xmx4g -jar /home/bits/NGS_DNASeq-training/bin/

You can proceed with Tab Tab (twice) to get all possible picard commands

Handicon.png If you do not have yet a '\' before $PICARD a single key sequence is enough.

Can you create files in a folder when you are located in another folder ?

Yes, you can. As long as you tell the command line where the folder that you want to use is located (by giving the path to the folder), you don't need to be in the folder in which you are creating the files.

How does IF - THEN - ELSE in command line work ?

The IF command has the following syntax:

if COMMANDS1; then
[ else

When you run this command, COMMANDS1 are executed:

  • if they succeed: COMMANDS2 are executed
  • if they do not succeed: COMMANDS3 are executed or if you do not use the else phrase (it's optional as you can see by the square brackets []) nothing will happen

Please note the syntax: the positions of the semicolons and the blanks are really important, if you do not type them the code is not executed.

What can you use the zless command for ?

zless opens compressed text files one screenful at a time in the terminal.
So the command is essentially the same as z|less where:

  • z means that the command will uncompress the file (gunzip it)
  • less is the standard command to open text files one screen at the time

Handicon.png zcat is the corresponding counterpart for cat where the FULL archive content is extracted and sent to screen.

What does the $ sign mean in the command line ?

It indicates that what is following is a variable (more info on using variables in Linux command line)
e.g. you create a variable as follows: FOLDER="/usr/bits/NGS_DNASeq-training/Final_Results/Illumina_hg18_mapping"
then you can use this variable later on in commands: cd $FOLDER

Cli tools.png Often, capitalised variable names are preferred for global or environment variables that are permanently stored (by including them in the user .profile or .bashrc file) and available in all terminal windows (eg $PATH, $HOME, $BASE). By contrast, lower-case variables are preferred for timely and local usage and destroyed when closing the terminal session (eg: $infolder, $outfile, $nthr). This is only a convention and is not obligatory but makes the reading of scripts more easy.

What does the 2> mean in the command line ?

The command line descriptors for standard input, standard output, and standard error messages are 0, 1, and 2, respectively.

  • 2 simply points to the error message part of the output.
  • > is the redirection sign: it redirects the output, in this case: the error messages, to a file

e.g. htscmd bamshuf -O -n128 $infile $outfile 2>htscmd_bamshuf_infile.err means that the error messages of the htscmd bamshuf command are printed to a file called htscmd_bamshuf_infile.err

Why is '| less' so useful with verbose commands ?

When you have a command that generates a large output and you are printing the output to the terminal, the output is just rolling over your screen at a very fast rate and you will not be able to read it. You can add | less (pipe to less) at the end of the command to print the output to the terminal one screen at the time. In this way, you will have the time to inspect the output at your own speed. This is obviously most useful for a first run and the last part replaced by a redirection to file next time you run the command.

# first time, we check that we indeed get only synonymous calls
grep -w "synonymous" myvariants.vcf | less
# then for real but loosing the VCF header in the process !
# note the '-w' for word that avoids finding non-synonymous lines
grep -w "synonymous" myvariants.vcf > my-synonymous-variants.txt
# finally, getting first the full header [note the '(...)' grouping]
( grep "^#" myvariants.vcf; grep -w "synonymous" myvariants.vcf ) > my-synonymous-variants.vcf

Does gzip remove the original files ?

Yes if you use the command without options or redirection, e.g. gzip myfile.txt, the file myfile.txt will be replaced by the compressed file myfile.txt.gz and the original file removed (no undo if this fails!).
But if you use gzip -c myfile.txt > myfile.txt.gz, the original file myfile.txt is kept and the compressed file is created in addition to the original one.

Why was the BAM file divided into 2 FASTQ files ?

Because the BAM contains data from paired reads. If you want to remap the reads, you need to divide the right end reads and the left end reads into two separate files. Some mappers can handle files that contain 'interleaved' reads (first,second,first,second,...) but since we know that our BAM file contains a lot of errors with regard to read pairs, it's safer to divide them.

Why was the BAM file reshuffled ?

This has been brought to our attention by Geraldine from the GATK team.
The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked. (read the full post here).

How can you find reads that belong to the same pair in the BAM file ?

  • They have the same name
  • One has name_1, the other name_2
  • One has name/1, the other name/2

Why use the VALIDATION_STRINGENCY = LENIENT argument in Picard tools ?

Picard is a very picky tool: when it bumps into an error, even a small one, it crashes. If you want to avoid this you can set the VALIDATION_STRINGENCY argument to LENIENT. This means the tool will turn the errors into warnings and ignores them. Two other arguments will reduce the quantity of text in the terminal, with the risk of missing some important message (java -jar $PICARD/command.jar -H for full help).

  • VERBOSITY=LogLevel | Control verbosity of logging. Default value: INFO. This option can be set to 'null' to clear the default value. Possible values: {ERROR, WARNING, INFO, DEBUG}
  • QUIET=Boolean | Whether to suppress job-summary info on System.err. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
  • VALIDATION_STRINGENCY=ValidationStringency Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT. This option can be set to 'null' to clear the default value. Possible values: {STRICT, LENIENT, SILENT}

How much coverage do I need ?

The short answer here is as much as you can afford.
Generally speaking for variant calling using data from a single genome, you need at least 20x coverage.
However, you should consider that using multiple biological replicates can also be relevant: using 3 replicates at 30x coverage is better than one sample at 100x coverage.
Additionally, the nature of the sample is also important: most samples are not pure. For instance, samples from tumor biopsies contain in most cases only 50% of tumor and 50% of non-tumor background tissue. Because of the background contamination, you have to double the coverage to obtain the same coverage for the tumor reads (the ones that you are interested in), so at least 40x (Complete Genomics delivers and average of 80x for tumour and 40x for normal samples, hence the quality of their data).

If your samples are very impure, it is worthwile to sequence ?

What if your background is even higher than 50% e.g. 80% or higher ?
It's hard to set a threshold to the impurity that is tolerated: the best advice is just to try one impure sample to see if it's worthwile to proceed with your other samples. What you dedicate to 'impurity is lost for the better part that you want to study. The more you 'waste', the less sensitive you will be at the end to catch small details.

Can IGV open my NGS files ?

Yes, IGV can open {bam, bed,bg,vcf,tdf} files and many more (see: IGV recommended formats).

Handicon.png IGV identifies formats from the file extension, act consequently ;-)

IGV displays genomic coordinate based data as 'tracks' aligned to the top reference genome selected by the user (many different reference genomes can be installed). In our case, both the tdf coverage file and the original bam file show the results of the mapping but one is a summary for the coverage while the other (bam) also shows the individual reads.
To create a tdf coverage track, you start from a bam file, transform the bam data into a bg (bed graphic) file with (eg) Bedtools, then adapt the result to a better supported IGV format with IGVTools to convert the bg text file into a binary tdf file. IGV can then load the tdf file much faster.

Handicon.png The benefit of tdf coverage files is that it is visible at any zoom level while you need to zoom-in quite a lot to start viewing the reads from the bam file.

Why do we need to annotate the variants ?

When obtaining human DNA-Seq data, you start from a huge vcf file containing over 3 million variants for an average human sample.
First of all you want to filter this huge file to get rid of the variants that you are not primarily interested in, e.g. you are only interested in variants in coding regions...
For this you need to know which of the variants lie in CDS and which are located in intergenic regions, in other words you need to annotate them. For this we propose to use annovar: it will add annotations to your variants so you can select the subset that is important for your research: e.g. variants that lead to a change in 1 amino acid, variants that lead to loss opf a start codon, variants that induce a stop codon... However, annovar only accepts a specific file format as input so first you have to transform your vcf file into an annovar file (no big deal). This transformation leads to the loss of a lot of information: only a few columns of the vcf file are retained so you lose many quality scores etc...
After using annovar annotation to filter the 'big data', you will hopefully be left with just a few thousands candidate causative variants. You can use this relatively small file, together with the original vcf file and recover the corresponding VCF-formatted data. This is done using Bedtools intersect seen in other parts of the training (see NGS_Exercise.4b).


[ Main_Page | NGS_data_analysis ]