Linux command line

From BITS wiki
Jump to: navigation, search
Go to parent NGS data analysis

Slides from GATK prep course

Slides from Biopythonprep course

Cheat sheet

Summer School Command Line Introduction

Last year we created Linux distributions with all tools and data installed for the RNASeq training. The idea was that scientists could easily install such a distribution in VirtualBox and run it on their Windows laptop. Although it did work, we noticed that some steps (especially the mapping) were very slow or even infeasible because the laptop's RAM is split between Windows and VirtualBox. This is why we no longer work via VirtualBox but directly onto the Linux partition so that we can use the full RAM of the laptops.

If you want we can still provide you with these Linux distributions. Send us an email to obtain the image and see the Installing the Linux distribution of the NGS training inside Windows using VirtualBox wiki page for installation instructions.

Your first command

Now the terminal hangs. We typed an incorrect command and the terminal does not know what to do.

If Ctrl-C fails, try hitting ESC. In most of the cases, this will do the trick.

The synopsis of this command is:

echo [-n] [string ...]
Things in between square brackets are optional, so it means that you can use echo without options and arguments.

When the manual page is longer than the terminal, you can scroll down the page one line at a time by pressing the down arrow key, or one page at a time by pressing the spacebar. To exit the man page, press q (for quit).

The manual explains that echo prints its argument by default to the screen and then puts the prompt on a new line. The way it does this is by appending a character called a newline (a special character that literally puts the text on a new line). Because echo is often used in programs to print out a sequence of strings not separated by newlines, there is an option to prevent the newline from being inserted.


Navigating the Linux file system

Type the following command in the terminal:

cd

cd stands for change directory and is used for navigating the Linux file system

You see that using cd without arguments leads you to your home directory, on the BITS laptops this is /home/bits.

On the BITS laptops the home directory /home/bits contains a set of folders like Desktop, Documents, Downloads...

A common pattern when using the command line is changing directories using cd and then immediately typing ls to view the contents of the directory.

Among others, the detailed list shows a date and time indicating the last time a file was modified. The number before the date is the size of the file in bytes.

/usr/local/bin corresponds to a directory in the file system (/), with bin a subdirectory of local and local a subdirectory of usr.

If you have to reuse a variable often then it can be helpful to create a name for a variable, especially when the variable is long. Suppose you want to work in a directory called Illumina_exp4_20042004_mapping_to_hg19_results. To avoid repeating this long name over and over you can create a variable for it, give it a short name and use that in your commands.

To create a new directory use the mkdir (make directory) command.

To remove a directory, use the rm (remove) command. You could use rmdir but this only works on empty folders. To remove a folder with the rm command you need to use the -r option. This stands for recursively which means it will remove the folder and its complete content.

Now navigate to the NGS folder which is located in the /home/bits/ folder.

If you want to move to a folder that's located in another location of the file system, you have to give the full path to the folder.


Manipulating files

Even without a text editor, there are ways to create a file with text using the redirect operator >

The redirect operator > takes the text output of echo and redirects its contents to a file called test1.txt

The name cat is short for “concatenate”. The command can be used to combine the contents of multiple files, but we use it here to dump the content of a single file to the screen. Cat is as a “quick-and-dirty” way to view the content of a file, less is a neater way.

The append operator >> appends the text output of echo to the file test1.txt

You don't have to type out test_partII.txt, instead you can type something like test_-Tab thereby making use of tab completion. Tab completion involves automatically completing a word if there’s only one valid match on the system. For example, if the only file starting with the letters “test_” is test_partII.txt, test_-Tab refers to test_partII.txt
Especially with longer names, tab completion can save a huge amount of typing.

Download the file called exams.txt, containing the results of the spelling and maths exams of all 10-year olds of a school, into your home folder. Use wget to download the file from http://data.bits.vib.be/pub/trainingen/NGSIntro/exams.txt

Two complementary commands for inspecting files are head and tail, which allow to view the beginning (head) and end (tail) of a file. The head command shows the first 10 lines of a file.

Open the manual of head to check out the options of head. Learn how to look at the first n lines of the file.

There are many commands to look at the full content of a file. The oldest of these programs is called more, and the more recent and powerful variant is called less. Less lets you navigate through the file in several ways, such as moving one line up or down with the arrow keys, pressing space bar to move a page down... Perhaps the most powerful aspect of less is the forward slash key /, which lets you search through the file from beginning to end.

The last three essential less commands are G to move to the end of the file and 1G to move back to the beginning. To quit less, type q (for quit).

The reason the pipe works is that the tail command, in addition to taking a filename as an argument, can take input from “standard in”, which in this case is the output of the command before the pipe. The tail program takes this input and processes it the same way it processes a file.


Running tools

Bioinformatics tools are just commands on the commands line. You use them in exactly the same way as all the commands we have run up to now, by defining options and arguments. A list of options and arguments can be found in the help file.

Installing and running sl

We have seen ls the list command and use it frequently to view the contents of a folder but because of miss-typing sometimes you would result in sl, how about getting a little fun in the terminal and not command not found. This is a general linux command, you can install it from a repository.

Try out some of the options !!

Running blastp

In the folder /home/bits/Linux/ you find a file called sprot.fasta containing a set of protein sequences. We will use this file as a database for blast. The query sequence is the following:

MLLFAPCGCNNLIVEIGQRCCRFSCKNTPCPMVHNITAKVTNRTKYPKHLKEVYDVLGGSAAWE

Blast can be done via the blast website, but you can also download the blast tool and run it locally (on your computer) via the command line. For instance if you want to blast against you own database of sequences, you have to do it locally. Blast has been installed on the bits laptops.

First you have transform your own database (the sprot.fasta file in our case) into a database that can be searched by blast using the makeblastdb command.

Now you can perform a blastp search using the blastp command. Write the results to a tabular text file with comments called output.txt


Running cutadapt

In this exercise we'll do some real NGS analysis on the SRR074262.fastq file that is stored in folder /home/bits/NGS/Intro.

This data sets contain a high number of adapter sequences. These are reads that consist solely or partly of adapter sequence. You have to remove this adapter contamination using command line tools like cutadapt. This tool is installed on the bits laptops. It is not a regular bash command (it's a python program) so it doesn't have a manual but it does have a help file.

At the top of the help file you see that the standard usage of the command is:

cutadapt -a ADAPTER -o output.fastq input.fastq

The sequence of the adapter is GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA


Running Picard

The trimmed fastq file is subsequently mapped resulting in a bam file that you can download from http://data.bits.vib.be/pub/trainingen/NGSIntro/1271011_reads.pair.1.list.accepted_hits.bam

This is a raw unsorted bam file, if we want to visualize the mapping results in IGV, we need to sort and index the file. We can do the sorting using one of the Picard tools, called SortSam. Picard can be downloaded from https://github.com/broadinstitute/picard/releases/download/2.8.2/picard-2.8.2.jar

For the tools to run properly, you must have Java 1.8 installed. To check your java version run the following command:

java -version
Running Java tools from the command line requires a special syntax: you have to start the command with java and then the name of the java tool and its options and arguments.

Java jar-files are archives of multiple java files (similar to tar archives of multiple regular files). They require an even more elaborate syntax. You have to start the command with java -jar and then the name of the jar file and its options and arguments. As you can see the picard tools come as a jar-file.

Bam files are enormous files that are hard to search through. The order of the reads in a bam file is the same as in the original fastq file. However, if you want to visualize the mapping results or if you want to calculate mapping statistics it's much more efficient to sort the reads according to genomic location. This can be achieved with the SortSam tool. Look in the picard documentation for the SortSam tool.

Bam files contain duplicate reads unless you removed them during the quality control step. MarkDuplicates locates and tags duplicate reads in a bam or sam file. Duplicate reads originate from the same fragment and were typically introduced during library construction using PCR. Duplicate reads can also result from a single cluster on the flow cell, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument.

MarkDuplicates compares sequences of reads and detects duplicates. The tool's output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field. If needed, duplicates can be removed using the REMOVE_DUPLICATE and REMOVE_SEQUENCING_DUPLICATES options. (See the Picard documentation for more details).

For visualization and easy access you can build an index to the bam file using BuildBamIndex. Look in the picard documentation for the BuildBam Index tool.

Check if the files were generated.


File compression


Writing scripts

Writing and executing bash scripts

We are going to make additions to the bash script you find below:

#this program pretends to hack sites
!Define a variable str equal to " 0  1  23  45  6 789"
clear
!Print to screen: "hacking www.bits.vib.be"
!Do nothing for 2 seconds
!Print to screen: "Server hacking module is loading"
!Do nothing for 2 seconds
!Print to screen: "Hack module is starting in 2 seconds"
!Do nothing for 1 second
!Print to screen: "1 second"
!Do nothing for 1 second
ping -c 3 www.bits.vib.be
!Do nothing for 1 second
netstat
!Do nothing for 1 second
for i in {1..1000}
do
number1=$RANDOM
let "number1 %= ${#str}"
number2=$RANDOM
let "number2 %=4"
!Print to screen without newlines and with backslash escapes: "\033[1m${str:number1:1}\033[0m"
done
!Print to screen: "453572345763425834756376534"
!Do nothing for 3 seconds
!Print to screen: "www.bits.vib.be succesfully hacked!"
!Print to screen: "PASSWORD ACCEPTED: token is 453572345763425834756376534"

Open gedit and paste the code.

Save the script as HackIt.sh

What if you want to "hack" another website ? The easiest way to do allow for this is to enable to give the url as an argument of the bash command so that's what we'll do.

Reopen the file in gedit

$1 refers to the first argument of the command. If you have two arguments you use $1 and $2 to represent them.

Writing and executing Perl scripts

We are going to create and the perl script you find below:

#This program predicts if a sequence is protein, nucleic acid or rubbish
$seq = $ARGV[0];
if ($seq =~ /[JO]/) {
  print "is not a sequence, first illegal character is $&\n";
} elsif ($seq =~ /[EFILPQZ]/) {
  print "is protein\n";
} else {
  print "is nucleic acid\n";
}

Open gedit and paste the code.

Save the script as SeqIt.pl


Writing and executing Python scripts

We are going to make additions to the python script you find below:

#This program counts the number of amino acids in a protein sequence
!Define variable mySequence equal to "SFTMHGTPVVNQVKVLTESNRISHHKILAIVGTAESNSEHPLGTAITKYCKQELDTETLGTCIDFQVVPGCGI"
!Create a set myUniqueAminoAcids out of mySequence
for aaCode in myUniqueAminoAcids:
  !Print to screen, use format to fill in the values: "Amino acid {} occurs {} times."

Open gedit and paste the code.

Save the script as CountIt.py

What if you want to "count" another protein ? The easiest way to do allow for this is to enable to give the sequence as an argument of the python command so that's what we'll do.

Reopen the file in gedit

sys.argv[1] refers to the first argument of the command. If you have two arguments you use sys.argv[1] and sys.argv[2] to represent them.

Installing and using Python tools

Installing Python-based tools is not done with apt-get, instead the comand pip is used. If pip is not yet installed, the terminal will show an error message saying that pip is currently not installed. You can install pip using apt-get.

As an example we will install Biopython, a Python library for bioinformatics. See the documentation for more details.

We will write a small python script to check if Biopython was successfully installed. In the folder /home/bits/Linux/ you find a file called sprot.fasta containing a set of protein sequences that we will use as input. Move to the folder containing the file.

We will use SeqIO module of Biopython to parse the fasta file with the protein sequences. Check out the tutorial of the module.

!Import the SeqIO module of the Bio library
!For every record in the sprot.fasta file do:
!Print the id of the seq_record to the screen
!Print the length of the sequence to the screen

Open gedit and paste the code.

Save the script as ParseIt.py in the folder that contains the input file.


Compressing and decompressing files

Some files or tools come in .zip format, how to decompress them ?

In the /usr/bin/tools folder you can find the zipped version of the FastQC tool. To unzip it, you have to use the unzip command.

The /usr/bin/ folder belongs to the root user, not to the bits user. Therefore only root is allowed to do manipulations in this folder. Switch to root using the su command or type sudo in front of your commands. The system will ask for the password: bitstraining on the BITS laptops.

After decompression use ls and cd to take a look at the content of the newly created FastQC folder. You will see the fastqc command in this folder.


Sorting files

We want to sort the file exams.txt from highest to lowest score on maths.

You can see that the sorting was not done correctly: it was done alphabetically, treating the numbers in the second column as characters, instead of numbers. This means that we are still missing an option that allows for numerical sorting.

This looks a lot better, but we still have to reverse the order since we want the scores from high to low.


The case of chromosomes and natural sorting.
'sort' will sort chromosomes as text; adding few more parameters allows to get the sort you need.

You don't want to numbers next to each other in one row, you want them in a column underneath each other. This means you want to replace the blanks by end-of-lines.

However, you do not want to print the text to the screen you want to print the text to a file. Look in the slides or the cheat sheet how to do this and try to combine the three parts of the command.

It prints the chromosome numbers as a column to the file chroms.txt

Remember to use q to leave a less page.

Not good! This is a tricky problem that always comes up when you are working with chromosome numbers e.g. when sorting bam/sam files, annotation files, vcf files...

Nice !
Now try with chr in front.


Getting files from the internet

To download data via a link on the internet you can use the wget command.
For NGS analysis you often need to download genomic sequence data from the internet. As an example we are going to download the E.coli genome sequence from the iGenomes website: ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Escherichia_coli_K_12_MG1655/NCBI/2001-10-15/Escherichia_coli_K_12_MG1655_NCBI_2001-10-15.tar.gz

Download this file into the folder NGS/ChIPSeq/ in your home directory.

In the same way you can download NGS data from the internet. We are not going to actually do this because NGS data sets are enormous and can take hours to download. Interrupting the download is done with Ctrl + C

This creates a new folder called Escherichia_coli_K_12_MG1655.
Go into this folder and look at the whole genome fasta sequence

Installing tools

The FastQC tool was installed by unzipping it. Most tools can be installed using the make command. There are many ways to install software on Linux:

  • via the software manager, an application with a very easy user friendly interface
  • via the apt-get command
  • software packages written in Python are installed via the pip install command
<p>These methods handle the installation and removal of software on Linux distribution in a simplified way. They fetch the software from software repositories on the internet. However, these repositories do not always contain the most up-to-date version of software packages, especially not for niche software like bioinformatics tools.

So to be on the safe side, it is recommended that you download the latest version of a tool from its website (using wget) and use make to install it. In that way, you have full control over the version of the tool that you are installing.

This is not true for pip. Pip does the difficult steps in the installation for you and accesses an up-to-date package repository, so Python programs can safely be installed using pip.

Download and install all packages in the tools folder of the /usr/bin/ folder. This is a folder owned by root so it is a good idea to switch to superuser again.


Installing TopHat

In the Introduction training we use RNA-Seq reads. Mapping RNA-Seq reads is done using the TopHat tool. So we need to install the TopHat tool. We are going to do this in the /usr/bin/NGS/ folder so we need to be superuser for this.

TopHat is downloaded as a .tar.gz file

Go into the tophat folder and type:

./tophat

If this opens the help of tophat, it means the software has been installed correctly. It does not mean that you can use the software now. Well you can but you will always have to type the commands from inside the tophat folder like we do here or provide the full path to the tophat folder. The dot slash (./) in front of the command means use the tophat that is located in this folder. It tells the command line where it can find the script (./ = the current directory = /usr/bin/tools/tophat-2.1.1.Linux_x86_64/).To avoid this we can create a symbolic link for tophat2 (see later).

Installing samtools

When you navigate to the tophat folder in /usr/bin/NGS/ you see that samtools is automatically installed when TopHat was installed:

MapRNASeq5.png

If you see the samtools help page when you type

./samtools_0.1.18

it means that samtools is indeed installed


MapRNASeq6.png


Installing tools for the ChIP-Seq training

Installing cutadapt

Cutadapt is a Python program that removes adapter sequences from NGS reads.

It has already been installed on the bits laptops but if you need to install it, use these instructions.


Quality control of NGS data

Checking the quality of the Introduction training data using FASTQC

In the /home/bits/NGS/Intro directory you can find a file called SRR074262.fastq (the file containing Arabidopsis RNA-Seq reads), that was used in the exercises on FastQC in Windows. FastQC is a tool that checks the quality of fastq files, containing NGS data.

We will now try to do the same FastQC analysis from command line in Linux. FastQC is a java-based tool that needs java to be able to run.

If you get an error then you don't have java installed. If the version listed on the first line is less than 1.5 then you will have problems running FastQC and you need to update java on your laptop.

This opens the FastQC GUI and you could load a fastq file via the GUI to get its quality report. However, you can also use fastqc as a command via the command line.

The big plus of running FastQC from command line is that command line allows you to combine and run a set of commands as a program by writing a command script.

Automating FASTQC analyses

If you have many FASTQ files to check you might prefer running FASTQC from command line so you can loop over your files and process the reports automatically.

To run via command line you can simply specify a list of files to process:

fastqc somefile.fastq someotherfile.fastq
You can specify as many files as you like. If you don't specify any files the program will open the GUI.

However, there are a few options that might be helpful to use. Since FASTQC can process FASTQ, SAM and BAM files, it is always safer to tell him upfront which format to expect.

We will generate FASTQC reports for the two FASTQ files in the /home/bits/NGS/RNASeq/ folder.

Decompression of the files results in two .fastq files that can be used as inputs generating the FASTQC reports.

The two .html files contain the FASTQC reports and can be opened in a browser.

By default, FastQC will create an HTML report with embedded graphs, but also a zip file containing individual graphs and additional data files containing the raw data from which the plots were drawn.

If you have many files you might want to use a for-loop instead of typing all file names into the command.

When you go to the /home/bits/NGS/RNASeq folder you should see the same html and zip files as in the previous exercise. The two .html files contain the FASTQC reports and can be opened in a browser.
If you want to save your reports in a folder other than the folder which contains your fastq files you can specify an alternative location by using the -o option.

When you go to the /home/bits/NGS/RNASeq/FASTQCresults folder you should see the same html and zip files as in the previous exercise. The two .html files contain the FASTQC reports and can be opened in a browser.

In this way you can process hundreds of FASTQ files automatically. You can even write a script to process the reports and create a general overview of the quality of the complete experiment.

In the Templates directory of the /usr/bin/tools/FastQC/ you will find a file called header_template.html which you can edit to change the look of the report. This file contains all information concerning the layout of the FASTQC reports like the header for the report, the CSS section... and you can alter this however you see fit.


Improving the quality of the data

In this exercise we go back to the data set of the Intro training in folder /home/bits/NGS/Intro.
Almost all NGS data sets contain a high number of contaminating adapter sequences. You can remove these adapters using command line tools like cutadapt. See installation instructions.

At the top of the help file you see that the standard usage of the command is:

cutadapt -a ADAPTER -o output.fastq input.fastq

You can find the sequence of the adapter in the FastQC report of SRR074262.fastq

Note that the default error rate means that you allow max. 10% mismatches in the alignment of adapter and read.

You can compare the results with these of the original reads on the Quality control of NGS data wiki page.

FASTQC calls a failure for this plot because it knows the file contains Illumina data and it expects the reads to have the same lengths. The software does not consider the fact that this is no longer true after trimming.


Linking files

Linking FastQC

In the previous exercise you had to specify the path of the fastqc command, otherwise the operating system was not able to find (and thus execute) the command. You can avoid having to specify the path every time you want to execute a command by creating a link to the command using the ln command.
You can soft or hard links, for what we want to achieve a soft link is fine. When you place a link to the command in /usr/local/bin you will be able to run the program from any location by just typing

fastqc

So the overall format of the command is as follows:

ln -s (soft link) path_where_fastqc_is (source path) /usr/local/bin/fastqc (destination path)

Check if you can run the fastqc command from any location now.

Linking Tophat2

If you don't create a symbolic link you have to specify the full path of the command when you want to run it, otherwise the operating system is not able to find (and thus execute) the command. You can avoid having to specify the full path every time you want to execute a command by creating a link to the command using the ln command. For creating symbolic links you need superuser powers!
You can make soft or hard links, for what we want to achieve a soft link is fine. When you place a link to the command in /usr/local/bin/ you will be able to run the program from any location by just typing its name.
So the overall format of the command is as follows:

ln -s (soft link) path_where_command_is (source path) /usr/local/bin/name (destination path)

Now type tophat2. If you see the help file, the link works.

If you mess up the link you have to remove it before you can try again using the following command:

sudo unlink /usr/local/bin/tophat2


Linking samtools

We will also do the same for samtools to use samtools from anywhere in the file system.

In many cases you will have several versions of samtools running on your laptop. That's why I don't call the tool samtools but I choose the full name including the version number.

Linking tools for the ChIP-Seq training


Mapping reads

Mapping reads of the ChIP-Seq training with Bowtie

Visualize mapping in IGV

IGV is installed on the bits laptops and can be run using the igv command.

igv
This opens the graphical user interface of the tool (similar to what we have with firefox during the class). Be patient, it might take a few minutes for the program to start.

We open the bam file that was generated by the Picard modules in IGV. The bam file contains Arabidopsis reads. This means we have to visualize them on the Arabidopsis genome. Change the genome in IGV from Human hg19 to A. thaliana (TAIR10).


IGV3.png

This should display the Arabidopsis genome in the top and the bottom view.</p>

Now it's time to load the mapped reads via File in the top menu and Load from File.

IGV4.png

Select the .bam file to open. You don't need to load the .bai file, it's suffcient that it is present in the same folder as the .bam file.

This loads the data into the center view. At this point, you can't see the reads, you have to zoom in to view them.

According to the supplemental material accompanying the paper describing this data set, AT1G02930 is highly expressed in all samples and differentially expreesed during the defense response in A. thaliana. So we will zoom in on this gene. You can do this by typing the accession number in the top toolbar and clicking Go:

IGV5.png

The reads for this gene are now visualized in the center view. You can zoom in even more using the zoom bar in the top toolbar:

IGV6.png

Zoom in until you see the nucleotides of the reference sequence.


IGV7.png

The reads are represented by grey arrows, the arrow indicating the orietation of the mapping. Hovering your mouse over a read gives additional info on the mapping. The colored nucleotides indicate mismatches between the read and the reference. Alignments that are displayed with light gray borders and white fill, have a mapping quality equal to zero. Interpretation of this mapping quality depends on the mapper as some commonly used mappers use this convention to mark a read with multiple alignments. In such a case, the read also maps to another location with equally good placement. It is also possible the read could not be uniquely placed but the other placements do not necessarily give equally good quality hits.

By default IGV calculates and displays the coverage track (red) for an alignment file. When IGV is zoomed to the alignment read visibility threshold (by default 30 KB), the coverage track displays the depth of the reads displayed at each locus as a gray bar chart. If a nucleotide differs from the reference sequence in greater than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base (A, C, G, T). You can view count details by hovering the mouse over a coverage bar:

IGV8.png