Linux command line
Go to parent NGS data analysis
Slides from Biopythonprep course
Summer School Command Line Introduction
Last year we created Linux distributions with all tools and data installed for the RNASeq training. The idea was that scientists could easily install such a distribution in VirtualBox and run it on their Windows laptop. Although it did work, we noticed that some steps (especially the mapping) were very slow or even infeasible because the laptop's RAM is split between Windows and VirtualBox. This is why we no longer work via VirtualBox but directly onto the Linux partition so that we can use the full RAM of the laptops.
If you want we can still provide you with these Linux distributions. Send us an email to obtain the image and see the Installing the Linux distribution of the NGS training inside Windows using VirtualBox wiki page for installation instructions.
Contents
- 1 Your first command
- 2 Navigating the Linux file system
- 3 Manipulating files
- 4 Running tools
- 5 File compression
- 6 Writing scripts
- 7 Compressing and decompressing files
- 8 Sorting files
- 9 Getting files from the internet
- 10 Installing tools
- 11 Quality control of NGS data
- 12 Improving the quality of the data
- 13 Linking files
- 14 Mapping reads
- 15 Mapping reads with Bowtie
Your first command
Print the word hello to the screen |
---|
echo hello |
Print the sentence “hello my friends to the screen (with open quotation and without end quotation) |
---|
Remember you can use the up arrow to go back to previously typed commands
echo "hello my friends |
Now the terminal hangs. We typed an incorrect command and the terminal does not know what to do.
What to type when the terminal hangs ? |
---|
Ctrl-C |
If Ctrl-C fails, try hitting ESC. In most of the cases, this will do the trick.
Open the manual of the echo command ? |
---|
man echo |
The synopsis of this command is:
echo [-n] [string ...]
When the manual page is longer than the terminal, you can scroll down the page one line at a time by pressing the down arrow key, or one page at a time by pressing the spacebar. To exit the man page, press q (for quit).
The manual explains that echo prints its argument by default to the screen and then puts the prompt on a new line. The way it does this is by appending a character called a newline (a special character that literally puts the text on a new line). Because echo is often used in programs to print out a sequence of strings not separated by newlines, there is an option to prevent the newline from being inserted.
By reading the man page, find the command to print hello without a newline, and verify that it works as expected. |
---|
Again remember to use the up arrow to go to previously typed commands.
echo -n hello |
Open the manual of the sleep command, find how to make the terminal “sleep” for 5 seconds, and execute the command. |
---|
man sleep According to the manual sleep has a required argument called number representing the number of seconds to sleep. sleep 5 |
Make the terminal sleep for 5000 seconds and rescue the terminal. |
---|
sleep 5000 That’s more than an hour so use Ctrl-C to break off the command. |
Type the following command in the terminal:
cd
cd stands for change directory and is used for navigating the Linux file system
Which directory are you in ? |
---|
To view the name of the current working directory, type
pwd pwd stands for print working directory. |
Which directories are located in your home directory ? |
---|
To view a list of the files and directories that are located in the current working directory, type
ls ls stands for list and is used for listing all files and directories in the current directory. |
On the BITS laptops the home directory /home/bits contains a set of folders like Desktop, Documents, Downloads...
List all files and directories in your home directory that start with the letter D |
---|
ls D* D* means everything which name starts with a D |
A common pattern when using the command line is changing directories using cd and then immediately typing ls to view the contents of the directory.
List the detailed content of your home directory ? |
---|
ls -l the l in -l stands for long output. |
List the content of the /usr/local/bin directory ? |
---|
ls /usr/local/bin |
If you have to reuse a variable often then it can be helpful to create a name for a variable, especially when the variable is long. Suppose you want to work in a directory called Illumina_exp4_20042004_mapping_to_hg19_results. To avoid repeating this long name over and over you can create a variable for it, give it a short name and use that in your commands.
Name the variable folder |
---|
Use the following command: folder=Illumina_exp4_20042004_mapping_to_hg19_results |
To create a new directory use the mkdir (make directory) command.
Create the folder using the newly created variable |
---|
If you want to refer to a named variable in a command you have to preceed the name by a $ sign to indicate that what is following is a reference to a variable. mkdir ${folder} The curly braces delineate the start and end of the variable name.
|
To remove a directory, use the rm (remove) command. You could use rmdir but this only works on empty folders. To remove a folder with the rm command you need to use the -r option. This stands for recursively which means it will remove the folder and its complete content.
Remove the Illumina_exp4_20042004_mapping_to_hg19_results directory. |
---|
Use the variable as an argument of the rm command: rm -r ${folder} Check if it's removed using the ls command. |
Now navigate to the NGS folder which is located in the /home/bits/ folder.
Navigate to this location. |
---|
Since you want to navigate, you need to use the cd command. Since the NGS folder is located in the folder that you are currently in, you can simply give the name of the folder (NGS) as an argument:cd NGS |
If you want to move to a folder that's located in another location of the file system, you have to give the full path to the folder.
Go to the /usr/bin folder |
---|
cd /usr/bin |
Go back to your home folder |
---|
cd |
Manipulating files
Even without a text editor, there are ways to create a file with text using the redirect operator >
Create a file called test1.txt containing the text "Why do bioinformaticians work on the command line?" using echo |
---|
echo "Why do bioinformaticians work on the command line?" > test1.txt |
Check if it worked by viewing the content of the file on the screen |
---|
cat test1.txt |
Add the line "Because they don't want to scare you with huge amounts of data!" to the file and check if it worked |
---|
To add lines of text to a file, use the append operator >>:
echo "Because they don't want to scare you with huge amounts of data!" >> test1.txt cat test1.txt |
Create an empty file called test2.txt and check if it exists |
---|
To create an empty file, use the touch command:
touch test2.txt ls |
List the names of all text files in your current directory |
---|
ls *.txt Here *.txt automatically expands to all filenames that match the pattern “any string followed by .txt”. |
Rename the file test2.txt to test_partII.txt using mv and check if it worked |
---|
To rename a file use the mv command, short for move:
mv test2.txt test_partII.txt ls *.txt |
Copy the file test_partII.txt to test2.txt and check if it worked |
---|
To copy a file use the cp command, short for copy:
cp test_partII.txt test2.txt ls *.txt |
You don't have to type out test_partII.txt, instead you can type something like test_-Tab thereby making use of tab completion. Tab completion involves automatically completing a word if there’s only one valid match on the system. For example, if the only file starting with the letters “test_” is test_partII.txt, test_-Tab refers to test_partII.txt
Especially with longer names, tab completion can save a huge amount of typing.
Remove the file test_partII.txt and check if it worked |
---|
To remove a file use the rm command, short for remove:
rm test_partII.txt ls *.txt |
Download the file called exams.txt, containing the results of the spelling and maths exams of all 10-year olds of a school, into your home folder. Use wget to download the file from http://data.bits.vib.be/pub/trainingen/NGSIntro/exams.txt
Download the file. |
---|
wget http://data.bits.vib.be/pub/trainingen/NGSIntro/exams.txt |
Show the first 10 lines of the file. |
---|
head exams.txt |
Two complementary commands for inspecting files are head and tail, which allow to view the beginning (head) and end (tail) of a file. The head command shows the first 10 lines of a file.
Show the last 10 lines of the file. |
---|
Similarly, tail shows the last 10 lines of a file.
tail exams.txt |
Open the manual of head to check out the options of head. Learn how to look at the first n lines of the file.
Save the first 30 lines of exams.txt in a file called test.txt |
---|
head -n 30 exams.txt > test.txt |
Look at test.txt using the less command |
---|
less test.txt |
There are many commands to look at the full content of a file. The oldest of these programs is called more, and the more recent and powerful variant is called less. Less lets you navigate through the file in several ways, such as moving one line up or down with the arrow keys, pressing space bar to move a page down... Perhaps the most powerful aspect of less is the forward slash key /, which lets you search through the file from beginning to end.
Search for Jasper in test.txt |
---|
The way to do this in less is to type /Jasper |
The last three essential less commands are G to move to the end of the file and 1G to move back to the beginning. To quit less, type q (for quit).
Look at the last 10 lines of the first 20 lines of exams.txt |
---|
head -n 20 exams.txt | tail The command runs head -n 20 exams.txt and then pipes the result through tail using the pipe symbol | |
The reason the pipe works is that the tail command, in addition to taking a filename as an argument, can take input from “standard in”, which in this case is the output of the command before the pipe. The tail program takes this input and processes it the same way it processes a file.
Running tools
Bioinformatics tools are just commands on the commands line. You use them in exactly the same way as all the commands we have run up to now, by defining options and arguments. A list of options and arguments can be found in the help file.
Installing and running sl
We have seen ls the list command and use it frequently to view the contents of a folder but because of miss-typing sometimes you would result in sl, how about getting a little fun in the terminal and not command not found. This is a general linux command, you can install it from a repository.
Install sl |
---|
For installing you need superuser privileges !
sudo apt-get install sl |
Find out in the manual what sl stands for |
---|
man sl You can find the solution in the description section of the manual. |
run the command |
---|
sl
|
Try out some of the options !!
Running blastp
In the folder /home/bits/Linux/ you find a file called sprot.fasta containing a set of protein sequences. We will use this file as a database for blast. The query sequence is the following:
MLLFAPCGCNNLIVEIGQRCCRFSCKNTPCPMVHNITAKVTNRTKYPKHLKEVYDVLGGSAAWE
Create a fasta file containing the query sequence using echo called seq.fasta |
---|
echo ">query seq" > seq.fasta cat seq.fasta echo MLLFAPCGCNNLIVEIGQRCCRFSCKNTPCPMVHNITAKVTNRTKYPKHLKEVYDVLGGSAAWE >> seq.fasta cat seq.fasta |
Blast can be done via the blast website, but you can also download the blast tool and run it locally (on your computer) via the command line. For instance if you want to blast against you own database of sequences, you have to do it locally. Blast has been installed on the bits laptops.
First you have transform your own database (the sprot.fasta file in our case) into a database that can be searched by blast using the makeblastdb command.
Look at the help file of makeblastdb and find the options to define the input fasta file and the database type |
---|
makeblastdb -help You have to define the input fasta file using the -in option and the type of sequences using the -dbtype option |
Create the blast database |
---|
makeblastdb -in sprot.fasta -dbtype prot |
Now you can perform a blastp search using the blastp command. Write the results to a tabular text file with comments called output.txt
Look at the help file of blastp and find the options to define input, database, output and output format |
---|
blastp -help You need the -query, the -db, the -out and the -outfmt option |
Perform the blast and open the results with less |
---|
blastp -query seq.fasta -db sprot.fasta -out output.txt -outfmt 7 less output.txt |
Running cutadapt
In this exercise we'll do some real NGS analysis on the SRR074262.fastq file that is stored in folder /home/bits/NGS/Intro.
Go to this folder and look at the 10 first lines of the file. |
---|
cd /home/bits/NGS/Intro head SRR0474262.fastq |
This data sets contain a high number of adapter sequences. These are reads that consist solely or partly of adapter sequence. You have to remove this adapter contamination using command line tools like cutadapt. This tool is installed on the bits laptops. It is not a regular bash command (it's a python program) so it doesn't have a manual but it does have a help file.
Check the help file of cutadapt for the option to define the adapter sequence and trim at the 3'ends of the reads. |
---|
To open the cutadapt help file type:
cutadapt -h The -a option trims adapter sequences at the 3' end of the reads. |
At the top of the help file you see that the standard usage of the command is:
cutadapt -a ADAPTER -o output.fastq input.fastq
The sequence of the adapter is GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA
Trim the adapter sequence and store the trimmed sequences in a file called SRR074262trim.fastq |
---|
Go to the folder where the input file is located and type:
cutadapt -a GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA -o SRR074262trim.fastq SRR074262.fastq |
Look at the first lines of the trimmed file |
---|
Go to the folder where the input file is located and type:
head SRR0474262trim.fastq |
Running Picard
The trimmed fastq file is subsequently mapped resulting in a bam file that you can download from http://data.bits.vib.be/pub/trainingen/NGSIntro/1271011_reads.pair.1.list.accepted_hits.bam
Download the file via the command line |
---|
wget http://data.bits.vib.be/pub/trainingen/NGSIntro/1271011_reads.pair.1.list.accepted_hits.bam |
Rename the file SRR074262.bam |
---|
Remember to use tab autocompletion !
mv 1271011_reads.pair.1.list.accepted_hits.bam SRR074262.bam |
This is a raw unsorted bam file, if we want to visualize the mapping results in IGV, we need to sort and index the file. We can do the sorting using one of the Picard tools, called SortSam. Picard can be downloaded from https://github.com/broadinstitute/picard/releases/download/2.8.2/picard-2.8.2.jar
Download the file |
---|
Remember to use tab autocompletion !
wget https://github.com/broadinstitute/picard/releases/download/2.8.2/picard-2.8.2.jar ll |
For the tools to run properly, you must have Java 1.8 installed. To check your java version run the following command:
java -version
Java jar-files are archives of multiple java files (similar to tar archives of multiple regular files). They require an even more elaborate syntax. You have to start the command with java -jar and then the name of the jar file and its options and arguments. As you can see the picard tools come as a jar-file.
Test the installation by opening the help file |
---|
java -jar picard-2.8.2.jar -h |
Bam files are enormous files that are hard to search through. The order of the reads in a bam file is the same as in the original fastq file. However, if you want to visualize the mapping results or if you want to calculate mapping statistics it's much more efficient to sort the reads according to genomic location. This can be achieved with the SortSam tool. Look in the picard documentation for the SortSam tool.
Sort the bam file to SRR074262sorted.bam |
---|
Remember to use tab autocompletion !
java -jar picard-2.8.2.jar SortSam \ I=SRR074262.bam \ O=SRR074262sorted.bam \ SORT_ORDER=coordinate |
Bam files contain duplicate reads unless you removed them during the quality control step. MarkDuplicates locates and tags duplicate reads in a bam or sam file. Duplicate reads originate from the same fragment and were typically introduced during library construction using PCR. Duplicate reads can also result from a single cluster on the flow cell, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument.
MarkDuplicates compares sequences of reads and detects duplicates. The tool's output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field. If needed, duplicates can be removed using the REMOVE_DUPLICATE and REMOVE_SEQUENCING_DUPLICATES options. (See the Picard documentation for more details).
Remove duplicates from the sorted bam file |
---|
Remember to use tab autocompletion !
java -jar picard.jar MarkDuplicates \ I=SRR074262sorted.bam \ O=SRR074262sortednodup.bam \ M=marked_dup_metrics.txt \ REMOVE_DUPLICATES=true |
For visualization and easy access you can build an index to the bam file using BuildBamIndex. Look in the picard documentation for the BuildBam Index tool.
Build the bai file for SRR074262sortednodup.bam |
---|
Remember to use tab autocompletion !
java -jar picard-2.8.2.jar BuildBamIndex \ I=SRR074262sortednodup.bam |
Check if the files were generated.
File compression
Compress the SRR074262.bam file to .gz format |
---|
Remember to use tab autocompletion !
gzip SRR074262.bam
ll |
and unzip it again |
---|
Remember to use tab autocompletion !
gunzip SRR074262.bam.gz
ll |
Writing scripts
Writing and executing bash scripts
We are going to make additions to the bash script you find below:
#this program pretends to hack sites !Define a variable str equal to " 0 1 23 45 6 789" clear !Print to screen: "hacking www.bits.vib.be" !Do nothing for 2 seconds !Print to screen: "Server hacking module is loading" !Do nothing for 2 seconds !Print to screen: "Hack module is starting in 2 seconds" !Do nothing for 1 second !Print to screen: "1 second" !Do nothing for 1 second ping -c 3 www.bits.vib.be !Do nothing for 1 second netstat !Do nothing for 1 second for i in {1..1000} do number1=$RANDOM let "number1 %= ${#str}" number2=$RANDOM let "number2 %=4" !Print to screen without newlines and with backslash escapes: "\033[1m${str:number1:1}\033[0m" done !Print to screen: "453572345763425834756376534" !Do nothing for 3 seconds !Print to screen: "www.bits.vib.be succesfully hacked!" !Print to screen: "PASSWORD ACCEPTED: token is 453572345763425834756376534"
Open gedit and paste the code.
Replace all lines that start with ! by the appropriate command |
---|
#this program pretends to hack sites str=" 0 1 23 45 6 789" clear echo "hacking www.bits.vib.be" sleep 2 echo "Server hacking module is loading" sleep 2 echo "Hack module is starting in 2 seconds" sleep 1 echo "1 second" sleep 1 ping -c 3 www.bits.vib.be sleep 2 netstat sleep 1 for i in {1..1000} do number1=$RANDOM let "number1 %= ${#str}" number2=$RANDOM let "number2 %=4" echo -n -e "\033[1m${str:number1:1}\033[0m" done echo "453572345763425834756376534" sleep 3 echo "www.bits.vib.be succesfully hacked!" echo "PASSWORD ACCEPTED: token is 453572345763425834756376534" |
Add a shebang line to the top of the script |
---|
#!/usr/bin/env bash #this program pretends to hack sites str=" 0 1 23 45 6 789" clear ... |
Save the script as HackIt.sh
If necessary make executable |
---|
chmod 755 HackIt.sh |
Run the script |
---|
bash HackIt.sh |
What if you want to "hack" another website ? The easiest way to do allow for this is to enable to give the url as an argument of the bash command so that's what we'll do.
Reopen the file in gedit
Replace www.bits.vib.be by $1 |
---|
#!/usr/bin/env bash #this program pretends to hack sites str=" 0 1 23 45 6 789" clear echo "hacking $1" sleep 2 echo "Server hacking module is loading" sleep 2 echo "Hack module is starting in 2 seconds" sleep 1 echo "1 second" sleep 1 ping -c 3 $1 sleep 2 netstat sleep 1 for i in {1..1000} do number1=$RANDOM let "number1 %= ${#str}" number2=$RANDOM let "number2 %=4" echo -n -e "\033[1m${str:number1:1}\033[0m" done echo "453572345763425834756376534" sleep 3 echo "$1 succesfully hacked!" echo "PASSWORD ACCEPTED: token is 453572345763425834756376534" |
Save and run the script again now giving www.kuleuven.be as an argument |
---|
bash HackIt.sh www.kuleuven.be |
$1 refers to the first argument of the command. If you have two arguments you use $1 and $2 to represent them.
Writing and executing Perl scripts
We are going to create and the perl script you find below:
#This program predicts if a sequence is protein, nucleic acid or rubbish $seq = $ARGV[0]; if ($seq =~ /[JO]/) { print "is not a sequence, first illegal character is $&\n"; } elsif ($seq =~ /[EFILPQZ]/) { print "is protein\n"; } else { print "is nucleic acid\n"; }
Open gedit and paste the code.
Add a shebang line to the top of the script |
---|
#!/usr/bin/env perl #This program predicts if a sequence is protein, nucleic acid or rubbish $seq = $ARGV[0]; if ($seq =~ /[JO]/) { ... |
Save the script as SeqIt.pl
If necessary make executable |
---|
chmod 755 SeqIt.pl |
Run the script using your first name in capitals as an argument |
---|
perl SeqIt.pl JANICK |
Writing and executing Python scripts
We are going to make additions to the python script you find below:
#This program counts the number of amino acids in a protein sequence !Define variable mySequence equal to "SFTMHGTPVVNQVKVLTESNRISHHKILAIVGTAESNSEHPLGTAITKYCKQELDTETLGTCIDFQVVPGCGI" !Create a set myUniqueAminoAcids out of mySequence for aaCode in myUniqueAminoAcids: !Print to screen, use format to fill in the values: "Amino acid {} occurs {} times."
Open gedit and paste the code.
Replace all lines that start with ! by the appropriate command |
---|
#This program counts the number of amino acids in a protein sequence mySequence = "SFTMHGTPVVNQVKVLTESNRISHHKILAIVGTAESNSEHPLGTAITKYCKQELDTETLGTCIDFQVVPGCGI" myUniqueAminoAcids = set(mySequence) for aaCode in myUniqueAminoAcids: print("Amino acid {} occurs {} times.".format(aaCode,mySequence.count(aaCode))) |
Add a shebang line to the top of the script |
---|
#!/usr/bin/env python #This program counts the number of amino acids in a protein sequence mySequence = "SFTMHGTPVVNQVKVLTESNRISHHKILAIVGTAESNSEHPLGTAITKYCKQELDTETLGTCIDFQVVPGCGI" myUniqueAminoAcids = set(mySequence) ... |
Save the script as CountIt.py
If necessary make executable |
---|
chmod 755 CountIt.py |
Run the script |
---|
python CountIt.py |
What if you want to "count" another protein ? The easiest way to do allow for this is to enable to give the sequence as an argument of the python command so that's what we'll do.
Reopen the file in gedit
Adjust the code to read the first argument of the python command using the sys library |
---|
!#/usr/bin/env python #This program counts the number of amino acids in a protein sequence import sys mySequence = sys.argv[1] myUniqueAminoAcids = set(mySequence) for aaCode in myUniqueAminoAcids: print("Amino acid {} occurs {} times.".format(aaCode,mySequence.count(aaCode))) |
Save and run the script again now giving QWEERTIPSDFFFGHKKKKLLLLLLLLLLLLLL as an argument |
---|
python CountIt.py QWEERTIPSDFFFGHKKKKLLLLLLLLLLLLLL |
sys.argv[1] refers to the first argument of the command. If you have two arguments you use sys.argv[1] and sys.argv[2] to represent them.
Installing and using Python tools
Installing Python-based tools is not done with apt-get, instead the comand pip is used. If pip is not yet installed, the terminal will show an error message saying that pip is currently not installed. You can install pip using apt-get.
As an example we will install Biopython, a Python library for bioinformatics. See the documentation for more details.
Install biopython |
---|
You need superuser privileges for this sudo pip install biopython |
We will write a small python script to check if Biopython was successfully installed. In the folder /home/bits/Linux/ you find a file called sprot.fasta containing a set of protein sequences that we will use as input. Move to the folder containing the file.
We will use SeqIO module of Biopython to parse the fasta file with the protein sequences. Check out the tutorial of the module.
!Import the SeqIO module of the Bio library !For every record in the sprot.fasta file do: !Print the id of the seq_record to the screen !Print the length of the sequence to the screen
Open gedit and paste the code.
Replace all lines that start with ! by the appropriate command |
---|
from Bio import SeqIO for seq_record in SeqIO.parse("sprot.fasta","fasta"): print(seq_record.id) print(len(seq_record)) |
Add a shebang line to the top of the script |
---|
#!/usr/bin/env python from Bio import SeqIO for seq_record in SeqIO.parse("sprot.fasta","fasta"): ... |
Save the script as ParseIt.py in the folder that contains the input file.
If necessary make executable |
---|
chmod 755 ParseIt.py |
Run the script |
---|
python ParseIt.py |
Compressing and decompressing files
Some files or tools come in .zip format, how to decompress them ?
In the /usr/bin/tools folder you can find the zipped version of the FastQC tool. To unzip it, you have to use the unzip command.
The /usr/bin/ folder belongs to the root user, not to the bits user. Therefore only root is allowed to do manipulations in this folder. Switch to root using the su command or type sudo in front of your commands. The system will ask for the password: bitstraining on the BITS laptops.
Decompress the FastQC tool with unzip. |
---|
First look at the unzip manual to get an idea about the working of the command.
man unzip unzip name_of_the_zip_file This will generate a folder called FastQC in /usr/bin/tools. |
After decompression use ls and cd to take a look at the content of the newly created FastQC folder. You will see the fastqc command in this folder.
Make sure that you can read, write and execute the fastqc command and that other people can read and execute it. |
---|
To see the current permissions of the command:
ls -l The command that allows you to change the access permissions of files and directories is chmod (change mode). chmod has two mandatory arguments:
As you can see root is the owner of the file. This is why you need to log on as superuser (= root) to be able to change root's files. |
Sorting files
We want to sort the file exams.txt from highest to lowest score on maths.
Sort the file based on score on maths. Write results to a file called examssort1.txt |
---|
You have to sort the lines in the file according to the maths score. So you want to sort the file based on the numbers in the second column: it means that you cannot use the default sort command (this will sort the lines based on the content of the first column) but you have to use an option that allows you to specify the column you wish to sort on. When you look in the manual you see that you can use the -k option for this: sort -k2 exams.txt This will sort the file according to the values in the second column, but it will overwrite the original file. To save the sorted list in a new file, examssort1.txt, use the redirect operator: > sort -k2 exams.txt > examssort1.txt |
Use the head command to look at the sorted file. |
---|
head examssort1.txt |
You can see that the sorting was not done correctly: it was done alphabetically, treating the numbers in the second column as characters, instead of numbers. This means that we are still missing an option that allows for numerical sorting.
Sort the file numerically based on score on maths. |
---|
sort -k2 -n exams.txt > examssort1.txt head examssort1.txt |
This looks a lot better, but we still have to reverse the order since we want the scores from high to low.
Sort the file numerically from highest to lowest score on maths. |
---|
For this we need to add a third option to the sort command. When you look in the manual you see that you can use the -r option for this: sort -k2 -n -r exams.txt > examssort1.txt head examssort1.txt |
Show the results of the 10 students with the highest scores on the maths exam using a single line of commands. |
---|
This means that you have to combine the head command and the sort command from the previous exercise into one single command. Remember that you can combine commands by writing them in the order they have to be performed, so in our case first sort then head, separated by the pipe operator: | sort -k2 -n -r exams.txt | head |
Show only the names of the 10 students with the highest scores on the maths exam using a single line of commands. |
---|
To leave out gender and scores you have to use the cut command. To specify which columns to cut you can use the -f option. Please note that the -f option specifies the column(s) that you want to retain ! As an argument you have to specify the name of the file you want to cut. sort -k2 -n -r exams.txt | head | cut -f3 |
The case of chromosomes and natural sorting.
'sort' will sort chromosomes as text; adding few more parameters allows to get the sort you need.
Write a list of human chromosomes (values: 22 to 1 X Y MT) to the screen. Use {end..begin} to define a numerical range. |
---|
Remember that you can use echo to print text to the screen, so to generate text. Try echo {22..1} X Y MT and see what happens... |
You don't want to numbers next to each other in one row, you want them in a column underneath each other. This means you want to replace the blanks by end-of-lines.
Replace blanks by end-of-lines. Use the sed command for this. |
---|
Look up the command for replacing text in the slides. Blanks are represented by \ (back slash followed by a blank) and end-of-lines are represented by \n (back slash followed by n). To replace all blanks by an end-of-line you need to add the g option (see sed tutorial for more info). So sed "s/\ /\n/g" should do the replacement. Of course you need to combine the two commands using the output of echo as input in sed. Look in the slides or the cheat sheet how to do this. |
However, you do not want to print the text to the screen you want to print the text to a file. Look in the slides or the cheat sheet how to do this and try to combine the three parts of the command.
Write chromosomes as a column to a file called chroms.txt |
---|
The correct solution is: echo {22..1} X Y MT | sed "s/\ /\n/g" > chroms.txt The s in the sed argument refers to substitution: you want to substitute blanks by end-of-lines, it is followed by the character you want to replace (a blank or "\ "), then the character you want to replace it with (an end-of-line or "\n"), then you add g to use sed recursively, in other words to do the substitution more than once so each time a blank is encountered. |
It prints the chromosome numbers as a column to the file chroms.txt
Look at the file using the less command. |
---|
less chroms.txt |
Remember to use q to leave a less page.
Sort the chromosome file by using a simple sort. Write results to chromssort.txt |
---|
sort chroms.txt > chromssort.txt head chromssort.txt |
Not good! This is a tricky problem that always comes up when you are working with chromosome numbers e.g. when sorting bam/sam files, annotation files, vcf files...
Modify the sort command so that the sorting of the chromosomes is done in the correct way. |
---|
Most people solve it by specifying that you want sort to do natural sorting using the -V option: sort -V chroms.txt > chromssort.txt head chromssort.txt |
Nice !
Now try with chr in front.
Create a file with values chr22 to chr1 chrX chrY chrMT into one column called chroms2.txt in one single command |
---|
echo chr{22..1} chrX chrY chrMT | sed "s/\ /\n/g" > chroms2.txt head chroms2.txt |
Sort the file into a new file called chromssort2.txt |
---|
sort -V chroms2.txt > chromssort2.txthead chroms2.txt |
Getting files from the internet
To download data via a link on the internet you can use the wget command.
For NGS analysis you often need to download genomic sequence data from the internet. As an example we are going to download the E.coli genome sequence from the iGenomes website: ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Escherichia_coli_K_12_MG1655/NCBI/2001-10-15/Escherichia_coli_K_12_MG1655_NCBI_2001-10-15.tar.gz
Download this file into the folder NGS/ChIPSeq/ in your home directory.
Download the data into this folder. |
---|
Go to this folder and use the wget command to download the data: cd /home/bits/NGS?ChIPSeq/ wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Escherichia_coli_K_12_MG1655/NCBI/2001-10-15/Escherichia_coli_K_12_MG1655_NCBI_2001-10-15.tar.gz ll |
In the same way you can download NGS data from the internet. We are not going to actually do this because
NGS data sets are enormous and can take hours to download. Interrupting the download is done with ⎈ Ctrl + C
Decompress the file. |
---|
tar -xzvf Escherichia_coli_K_12_MG1655_NCBI_2001-10-15.tar.gz ll |
This creates a new folder called Escherichia_coli_K_12_MG1655.
Go into this folder and look at the whole genome fasta sequence
Look at the fasta sequence. |
---|
Use cd to navigate the folders and head to look at the file
cd Escherichia_coli_K_12_MG1655 ll cd NCBI ll cd 2001-10-15 ll cd Sequence ll cd WholeGenomeFasta ll head genome.fa |
Installing tools
The FastQC tool was installed by unzipping it. Most tools can be installed using the make command. There are many ways to install software on Linux:
- via the software manager, an application with a very easy user friendly interface
- via the apt-get command
- software packages written in Python are installed via the pip install command
So to be on the safe side, it is recommended that you download the latest version of a tool from its website (using wget) and use make to install it. In that way, you have full control over the version of the tool that you are installing.
This is not true for pip. Pip does the difficult steps in the installation for you and accesses an up-to-date package repository, so Python programs can safely be installed using pip.
Download and install all packages in the tools folder of the /usr/bin/ folder. This is a folder owned by root so it is a good idea to switch to superuser again.
Installing TopHat
In the Introduction training we use RNA-Seq reads. Mapping RNA-Seq reads is done using the TopHat tool. So we need to install the TopHat tool. We are going to do this in the /usr/bin/NGS/ folder so we need to be superuser for this.
Go to the TopHat website and fetch the download link. |
---|
|
Download the file into the /usr/bin/NGS/ folder. |
---|
|
TopHat is downloaded as a .tar.gz file
Decompress the file |
---|
For decompressing a .tar.gz file you need the following command:
tar -xzvf tophat-2.1.1.Linux_x86_64.tar.gz Remember to use tab autocompletion ! This creates a new folder called tophat-2.1... |
Go into the tophat folder and type:
./tophat
If this opens the help of tophat, it means the software has been installed correctly. It does not mean that you can use the software now. Well you can but you will always have to type the commands from inside the tophat folder like we do here or provide the full path to the tophat folder. The dot slash (./) in front of the command means use the tophat that is located in this folder. It tells the command line where it can find the script (./ = the current directory = /usr/bin/tools/tophat-2.1.1.Linux_x86_64/).To avoid this we can create a symbolic link for tophat2 (see later).
Installing samtools
When you navigate to the tophat folder in /usr/bin/NGS/ you see that samtools is automatically installed when TopHat was installed:
If you see the samtools help page when you type
./samtools_0.1.18
it means that samtools is indeed installed
Installing tools for the ChIP-Seq training
Installing cutadapt
Cutadapt is a Python program that removes adapter sequences from NGS reads.
It has already been installed on the bits laptops but if you need to install it, use these instructions.
Quality control of NGS data
Checking the quality of the Introduction training data using FASTQC
In the /home/bits/NGS/Intro directory you can find a file called SRR074262.fastq (the file containing Arabidopsis RNA-Seq reads), that was used in the exercises on FastQC in Windows. FastQC is a tool that checks the quality of fastq files, containing NGS data.
We will now try to do the same FastQC analysis from command line in Linux. FastQC is a java-based tool that needs java to be able to run.
Check if the correct version of java is installed |
---|
In command line you can check if java is installed on your laptop using the following command:
java -version You should see something like: ava version "1.8.0_101" Java(TM) SE Runtime Environment (build 1.8.0_101-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode) |
If you get an error then you don't have java installed. If the version listed on the first line is less than 1.5 then you will have problems running FastQC and you need to update java on your laptop.
Run FastQC |
---|
To run FastQC as a GUI just like in Windows type:
fastqc |
This opens the FastQC GUI and you could load a fastq file via the GUI to get its quality report. However, you can also use fastqc as a command via the command line.
Open the file SRR074262.fastq to obtain the sequence of the contaminating adapter. |
---|
Check the exercise on FastQC in Windows for details on the quality report that is generated |
The big plus of running FastQC from command line is that command line allows you to combine and run a set of commands as a program by writing a command script.
Automating FASTQC analyses
If you have many FASTQ files to check you might prefer running FASTQC from command line so you can loop over your files and process the reports automatically.
View the help files of the fastqc command |
---|
As for most commands the -h option nicely opens the help file:
fastqc -h
</p> |
To run via command line you can simply specify a list of files to process:
fastqc somefile.fastq someotherfile.fastq
However, there are a few options that might be helpful to use. Since FASTQC can process FASTQ, SAM and BAM files, it is always safer to tell him upfront which format to expect.
We will generate FASTQC reports for the two FASTQ files in the /home/bits/NGS/RNASeq/ folder.
Decompress the files |
---|
First you have to decompress the fastq files. In the cheat sheet look up the command for decompressing a .gz file
gunzip chr22_SRR1039509_1.fastq.gz gunzip chr22_SRR1039509_2.fastq.gz |
Decompression of the files results in two .fastq files that can be used as inputs generating the FASTQC reports.
Generate the FASTQC reports for the two fastq files. |
---|
As you can see in the help file of fastqc, the -f option allows you to specify the format of the input file(s).
fastqc -f fastq chr22_SRR1039509_1.fastq chr22_SRR1039509_2.fastq |
The two .html files contain the FASTQC reports and can be opened in a browser.
Open the first report in firefox via command line |
---|
firefox chr22_SRR1039509_1_fastqc.html |
By default, FastQC will create an HTML report with embedded graphs, but also a zip file containing individual graphs and additional data files containing the raw data from which the plots were drawn.
Remove the .html and the .zip files |
---|
rm *.zip rm *.html |
If you have many files you might want to use a for-loop instead of typing all file names into the command.
Write a for-loop to process the two FASTQ files. |
---|
First go back to the folder that contains the fastqc command and make sure you are operating as superuser. Take a close look at the syntax of the for-loop that is described in the slides. We are going to use the syntax for looping over files in a folder. Don' t forget the * to loop over all fastq files in the specified folder: for file in /home/bits/NGS/RNASeq/*.fastq do fastqc -f fastq ${file} done |
When you go to the /home/bits/NGS/RNASeq folder you should see the same html and zip files as in the previous exercise. The two .html files contain the FASTQC reports and can be opened in a browser.
If you want to save your reports in a folder other than the folder which contains your fastq files you can specify an alternative location by using the -o option.
Create a new folder called FASTQCresults |
---|
mkdir FASTQCresults |
Create a variable output. Its value is the path to the newly created folder. |
---|
output=/home/bits/NGS/RNASeq/FASTQCresults/ |
Write a for-loop to analyze the quality of the fastq files and write the report to the new folder |
---|
Adjust the code of the for-loop to write the results to the newly created folder for file in /home/bits/NGS/RNASeq/*.fastq do fastqc -f fastq -o ${output} ${file} done Don't forget the $ since output and file are variables. Write every line on a different line in the terminal. |
When you go to the /home/bits/NGS/RNASeq/FASTQCresults folder you should see the same html and zip files as in the previous exercise. The two .html files contain the FASTQC reports and can be opened in a browser.
In this way you can process hundreds of FASTQ files automatically. You can even write a script to process the reports and create a general overview of the quality of the complete experiment.
In the Templates directory of the /usr/bin/tools/FastQC/ you will find a file called header_template.html which you can edit to change the look of the report. This file contains all information concerning the layout of the FASTQC reports like the header for the report, the CSS section... and you can alter this however you see fit.
Improving the quality of the data
In this exercise we go back to the data set of the Intro training in folder /home/bits/NGS/Intro.
Almost all NGS data sets contain a high number of contaminating adapter sequences. You can remove these adapters using command line tools like cutadapt. See installation instructions.
Check the help file for the option that defines the number of mismatches you allow (= error rate). |
---|
To open the cutadapt help files (it's not a regular bash command so it doesn't have a manual) type:
cutadapt -h Scrolling down the help file shows that the -e option defines the maximum allowed error rate: the default is 0.1 meaning that it allows one mismatch every 10 nucleotides. Adapter sequences are identified by aligning each read to the adapter sequence: if the frequency of mismatches in the alignment is below the allowed error rate then the adapter sequence is trimmed from the read. |
Check the option you need for defining the adapter sequence |
---|
In the help file you see that you have multiple options:
Since we probably have contaminating adapter at the 3' end we'll take the -a option |
At the top of the help file you see that the standard usage of the command is:
cutadapt -a ADAPTER -o output.fastq input.fastq
You can find the sequence of the adapter in the FastQC report of SRR074262.fastq
Trim the adapter sequence using the default error rate, store the trimmed sequences in a file SRR074262trim.fastq |
---|
So in our case the command is:
cutadapt -a GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA -o SRR074262trim.fastq SRR074262.fastq |
Note that the default error rate means that you allow max. 10% mismatches in the alignment of adapter and read.
How many reads consisted solely of adapter sequence (and were consequently completely removed) ? |
---|
The output of the cutadapt command is:
This is cutadapt 1.8.1 with Python 2.7.6 Command line parameters: -a GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA -o SRR074262trim.fastq SRR074262.fastq Trimming 1 adapter with at most 10.0% errors in single-end mode ... Finished in 66.92 s (7 us/read; 8.62 M reads/minute). === Summary === Total reads processed: 9,619,406 Reads with adapters: 2,327,902 (24.2%) Reads written (passing filters): 9,619,406 (100.0%) Total basepairs processed: 346,298,616 bp Total written (filtered): 271,141,022 bp (78.3%) === Adapter 1 === Sequence: GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA; Type: regular 3'; Length: 36; Trimmed: 2327902 times. No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-36 bp: 3 Bases preceding removed adapters: A: 6.1% C: 1.5% G: 1.8% T: 3.0% none/other: 87.5% Overview of removed sequences length count expect max.err error counts 3 156030 150303.2 0 156030 4 48693 37575.8 0 48693 5 12005 9394.0 0 12005 6 8702 2348.5 0 8702 7 6686 587.1 0 6686 8 5546 146.8 0 5546 9 5958 36.7 0 5484 474 10 5479 9.2 1 4539 940 11 4197 2.3 1 3737 460 12 4038 0.6 1 3713 325 13 3392 0.1 1 3158 234 14 2730 0.0 1 2531 199 15 2801 0.0 1 2625 176 16 2384 0.0 1 2221 163 17 1887 0.0 1 1759 128 18 1998 0.0 1 1848 150 19 1572 0.0 1 1447 123 2 20 1257 0.0 2 1079 107 71 21 1141 0.0 2 1029 90 22 22 730 0.0 2 671 46 13 23 504 0.0 2 471 21 12 24 549 0.0 2 499 37 13 25 495 0.0 2 441 39 15 26 587 0.0 2 538 35 14 27 657 0.0 2 585 53 19 28 711 0.0 2 633 40 26 12 29 764 0.0 2 687 49 24 4 30 889 0.0 3 760 85 33 11 31 887 0.0 3 739 94 42 12 32 579 0.0 3 466 65 37 11 33 438 0.0 3 347 36 38 17 34 700 0.0 3 541 85 53 21 35 5390 0.0 3 4652 507 171 60 36 2037526 0.0 3 1870684 129754 20094 16994 In the last line you see the number of reads with 36 bases aligned to the adapter sequence. Since that is the total of the read (the reads are 36bp long) it means that over 2 million reads only consist of adapter sequence, 1.870.684 being completely identical to the adapter, 129.754 containing 1 mismatch with the adapter... |
Open the trimmed sequences in FastQC |
---|
To open the FastQC GUI type the fastqc command
fastqc |
You can compare the results with these of the original reads on the Quality control of NGS data wiki page.
Are all the reads still 36 nt long after trimming ? |
---|
In the Basic statistics tab you see that the length of the reads varies as was to be expected after trimming
|
Have the quality scores of the reads significantly changed after trimming ? |
---|
The Per base sequence quality is similar to that of the untrimmed file, as is the Per sequence quality. The latter one just shows a lower number of sequences since the 2 million reads that consisted solely of adapter sequence are no longer taken into account.
Quality scores have changed a bit of course since you removed bases and reads from the data set but you did not trim based on quality but based on similarity to an adapter sequence so the scores of the trimmed reads are similar to those of the untrimmed reads. If you had trimmed low quality bases, the quality scores would have been higher in the trimmed reads. |
Has the per base sequence content improved as a result of the trimming ? |
---|
The Per base sequence content - the tool to detect adapter contamination - plot has greatly improved allthough it is still not considered stable enough.
|
What are the bumps you see in the Sequence length distribution plot ? |
---|
This question is related to the results of the trimming: Overview of removed sequences length count expect max.err error counts 3 156030 150303.2 0 156030 4 48693 37575.8 0 48693 5 12005 9394.0 0 12005 ... 33 438 0.0 3 347 36 38 17 34 700 0.0 3 541 85 53 21 35 5390 0.0 3 4652 507 171 60 36 2037526 0.0 3 1870684 129754 20094 16994 As you can see here over 2 million reads corresponded to adapter over their entire length and as a result were trimmed to length zero. This is the large peak at length zero on the plot. Over 150000 reads contain 3 bases that belong to the adapter. These 3 bases have been cut leaving reads of 33 nt long: this is the small peak you see on the plot at length 33. All intermediate lengths of adapter contamination have been detected but in such a small fraction of reads that you cannot see the influence of the trimming on the plot.
|
FASTQC calls a failure for this plot because it knows the file contains Illumina data and it expects the reads to have the same lengths. The software does not consider the fact that this is no longer true after trimming.
Are there any overrepresented sequences left ? |
---|
The 2 million sequences that were initially detected as contaminating adapters are still in the list but now as sequences with zero length. The other contaminating sequences are of course still present but at very low counts.
|
Are there any overrepresented hexamers ? |
---|
FASTQC still detects overrepresented hexamers although at much lower counts than before. These are probably parts of the remaining overrepresented sequences.
|
Linking files
Linking FastQC
In the previous exercise you had to specify the path of the fastqc command, otherwise the operating system was not able to find (and thus execute) the command. You can avoid having to specify the path every time you want to execute a command by creating a link to the command using the ln command.
You can soft or hard links, for what we want to achieve a soft link is fine. When you place a link to the command in /usr/local/bin you will be able to run the program from any location by just typing
fastqc
So the overall format of the command is as follows:
ln -s (soft link) path_where_fastqc_is (source path) /usr/local/bin/fastqc (destination path)
What's the command you would need for creating this soft link ? |
---|
When you look in the manual of ln you see that for creating a soft link you need the -s option. So you use the following command:
ln -s /usr/bin/tools/FastQC/fastqc /usr/local/bin/fastqc |
Check if you can run the fastqc command from any location now.
Linking Tophat2
If you don't create a symbolic link you have to specify the full path of the command when you want to run it, otherwise the operating system is not able to find (and thus execute) the command. You can avoid having to specify the full path every time you want to execute a command by creating a link to the command using the ln command. For creating symbolic links you need superuser powers!
You can make soft or hard links, for what we want to achieve a soft link is fine. When you place a link to the command in /usr/local/bin/ you will be able to run the program from any location by just typing its name.
So the overall format of the command is as follows:
ln -s (soft link) path_where_command_is (source path) /usr/local/bin/name (destination path)
Create a symbolic link for tophat2 |
---|
For creating the link you need the following command:
sudo ln -s /usr/bin/NGS/tophat-2.1.1.Linux_x86_64/tophat2 /usr/local/bin/tophat2 Remember to use tab autocompletion ! |
Now type tophat2. If you see the help file, the link works.
If you mess up the link you have to remove it before you can try again using the following command:
sudo unlink /usr/local/bin/tophat2
Linking samtools
We will also do the same for samtools to use samtools from anywhere in the file system.
Create a symbolic link for samtools |
---|
Create a link using the ln -s command:
sudo ln -s /usr/bin/NGS/tophat-2.1.1.Linux_x86_64/samtools_0.1.18 /usr/local/bin/samtools-0.1.18 Check if the command works. If you type samtools-0.1.18 view (one of the possible samtools commands) you should see the manual of the command. |
In many cases you will have several versions of samtools running on your laptop. That's why I don't call the tool samtools but I choose the full name including the version number.
Linking tools for the ChIP-Seq training
Mapping reads
Mapping reads of the ChIP-Seq training with Bowtie
Mapping reads with Bowtie
Exercise created by Morgane Thomas Chollier
Obtaining the reference genome
If you are going to follow the ChIP-Seq training, skip this part: you are going to do these steps during the ChIP-Seq training. The fasta file containing the reference genome is called Escherichia_coli_K12.fasta and is stored in the /home/bits/NGS/ChIPSeq/ folder on the BITS laptops. Alternatively you can use the file that you downloaded via wget in exercise 3.
If you are not going to follow the ChIP-Seq training, go on and see how to obtain the reference genome.
Back to the ChIP-Seq data of E. coli. In this experiment we want to see which genomic regions are bound to transcription factor FNR. However, at this point what we have is a set of reads that are identified by their location of the flow cell. To answer our question we should link the reads to regions in the genome.
To obtain their genomic coordinates, we will map each read on the reference genome sequence
As said before, for Illumina reads the standard mappers are BWA and Bowtie (version 1 and 2). In this exercise we will use Bowtie version1.
Check out the installation instructions for Bowtie.
Bowtie1 was installed and a symbolic link was created so the command should work from anywhere in the file system when you type bowtie-1.1.2
What happens when you type the bowtie command ? |
---|
This prints the help of the program. However, the help file is a bit difficult to read ! If you need to know more about the program, it's easier to directly check the manual on the website |
Bowtie needs a reference sequence to align each read on it.
Which E. coli strain was used in the experiment ? |
---|
Go to the paper and check the part Strains and growth conditions in the Materials and methods section. There you see that the experiment was done using E. coli K-12 MG1655.
|
So we need the genome sequence of E. coli K-12 MG1655 and it needs to be in a specific format (=index) for bowtie to be able to use it. Several pre-built indexes are available to download on the bowtie webpages or the iGenomes website.
Although the E. coli sequence is available we will not use it to show you how you should proceed if you don't find your reference sequence here. In that case you will need to make the index file yourself.
If you can't find your reference on the iGenomes website you have to download it from:
Since Ensembl focuses on higher eukaryotes, we are going to download the genome from NCBI.
Which reference sequence was used in the experiment ? |
---|
Go to the paper and check the part High-throughput RNA sequencing (RNA-seq) analysis. There you see that the reads were mapped to an NCBI sequence with version number U00096.2.
|
Search for this sequence on NCBI ? |
---|
Go to the NCBI website, select the Nucleotide database, type U00096.2 as a search term and click Search. In the record of this sequence you see that an updated version is available. Click the See current version link.
|
This sequence is not a RefSeq sequence (the high quality part of NCBI Genbank). You can see that because the version number does not contain an underscore and all RefSeq version numbers contain an underscore.
Is there a RefSeq sequence available ? |
---|
In the record of the current version, scroll down to the Related information section in the right menu. There you see that a RefSeq sequence is available. Click the Identical RefSeq link.
|
This brings us to a RefSeq record with version number NC_000913.3. Note that we will not take this lastest version but the previous one (NC_000913.2), because the available tools for visualization have not been updated yet to the latest version. This will not affect our results.
Download the sequence of the previous version of the RefSeq record in FASTA format |
---|
Search the Nucleotide database for NC_000913.2
|
This creates a file called sequence.fasta in the Downloads folder in your Home folder. Copy the downloaded file to the folder where the fastq files are located (/home/bits/NGS/ChIPSeq on the BITS laptops) and rename it as Escherichia_coli_K12.fasta.
Writing a bash script to map the reads to the reference genome
Suppose that you expect to be doing many NGS experiments on E. coli. Each time we analyze a data set, we will have to map the reads against the E. coli genome. The best way to ensure that you can reuse commands during the next analysis, is to combine them into a script (= small program). Since the script will consist of command line (= bash) commands, the script is called a bash script.
You cannot do the mapping directly on the .fasta file, you need to index the file first. Reference genomes from the Bowtie /iGenomes website are already indexed so when you get your reference there you can skip this step. Reference genomes downloaded from NCBI, Ensembl or UCSC need to be indexed using the bowtie-build command.
Indexing a reference genome is a one-time effort: you do not have to repeat it each time you do a mapping. This is why we are not going to include the indexing in the script.
Create a variable called folder containing the path to the folder that contains the E. coli fasta file |
---|
folder=/home/bits/NGS/ChIPSeq/ |
Check out the manual of the bowtie-1.1.2-build command to see the arguments it takes |
---|
Since we have created a soft link for the bowtie-1.1.2-build command, the command should work from any location in the Linux file system. To see to help file just type the command:
bowtie-1.1.2-build In the help file you see that you need to specify the reference genome that you want to index as an input (in our case the E. coli fasta file) and that you have to specify the output file. Usage: bowtie-build [options]* <reference_in> <ebwt_outfile_base> reference_in comma-separated list of files with ref sequences ebwt_outfile_base write Ebwt data to files with this dir/basename |
We will give the output files the same name as our input file: Escherichia_coli_K12
Prepare an indexed reference sequence for E. coli using the bowtie-build command, use the folder variable |
---|
So as an input the command expects the name of the input and the output file.
bowtie-1.1.2-build ${folder}Escherichia_coli_K12.fasta ${folder}Escherichia_coli_K12 |
bowtie-build will index the Escherichia_coli_K12.fasta generating a whole set of .ebwt files whose name all start with Escherichia_coli_K12. We will write a bash script to do the rest of the mapping.
Writing a script can be done in any text editor. On the BITS laptops you can use gedit:
- Click the Menu at the bottom left corner of the desktop
- Type gedit in the text search box
- Click the Text Editor button
The first thing you do when you write a script is define all the variables you need.
We need the following variables:
- The folder that contains the reference genome.
- The name of the input fastq file you want to map (if it's in the same folder as the reference as it is in our case). If the fastq file is in another folder you have to specify the full path to the file.
Create the required variables |
---|
folder=/home/bits/NGS/ChIPSeq/ input=SRR576933 |
Make sure that the file containing the indexed reference genome and the fastq files containing the E. coli reads are located in the same folder.
Check the help file for bowtie-1.1.2 |
---|
Go back to the terminal and type
bowtie-1.1.2 |
What is the first argument bowtie expects ? |
---|
As first argument bowtie expects the path to the ebwt files (= the genome index files) so in our case that's Escherichia_coli_K_12
Usage: bowtie [options]* <ebwt> |
What is the second argument bowtie expects ? |
---|
As second argument bowtie expects the information of the input file containing the reads, in our case SRR576933.fastq Bowtie can be used to map single end reads as we have but also to map paired end reads. In the case of paired end reads you have two fastq files, one with the upstream reads and one with the downstream reads. That's why you can specify two input files m1 and m2. In our case it's just one file.Usage: bowtie [options]* <ebwt> {-1 <m1> -2 <m2> <m1> Comma-separated list of files containing upstream mates (or the sequences themselves, if -c is set) paired with mates in <m2> <m2> Comma-separated list of files containing downstream mates (or the sequences themselves if -c is set) paired with mates in <m1> |
What is the final argument bowtie expects ? |
---|
As final argument bowtie expects the output file which is in our case SRR576933.sam
Usage: bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | <s>} [<hit>] <hit> File to write hits to (default: stdout) |
You need to tell bowtie which type of file your input file is.
What is the option for doing this ? |
---|
Via the option: -q indicates the input file is in FASTQ format.
Usage: Input: -q query input files are FASTQ .fq/.fastq (default) |
Fastq is the default, so you don't have to explicitly set this option. If you don't specify it in your command bowtie will automatically assume your input is in fastq format.
You need to tell bowtie the maximum number of mismatches you allow in the alignments of your reads to the reference.
What is the option for doing this ? |
---|
Via the option: -v
Alignment: -v <int> report end-to-end hits w/ <=v mismatches; ignore qualities |
If you set this option's argument to 2, it means that bowtie will allow two mismatches anywhere in the read, when aligning the read to the genome sequence.
Then we want to set an option that allows to define a number of bases that should be trimmed from the 3' ends of the reads before the alignment is done.
What is the option for doing this ? |
---|
Via the option: -3
-3/--trim3 <int> trim <int> bases from 3' (right) end of reads |
We want to set this option to trim the last base from the 3' ends of the reads before the alignment is done.
We also want to specify that we only want reads that map specifically to one location in the genome in our output.
What is the option for doing this ? |
---|
Via the option: -m
-m <int> suppress all alignments if > <int> exist (def: no limit) |
Finally we want to specify that the output should be SAM format.
What is the option for doing this ? |
---|
Via the option: -S
SAM: -S/--sam write hits in SAM format |
Write the error channel to a file called SRR576933.out |
---|
Via the option: -S
2> SRR576933.out |
In the script you use the variables to you have created instead of the actual file name SRR576933
Map the reads to the indexed reference sequence ? |
---|
So the full script becomes: folder=/home/bits/NGS/ChIPSeq/ input=SRR576933 bowtie-1.1.2 ${folder}Escherichia_coli_K12 -q ${folder}${input}.fastq -v 2 -m 1 -3 1 -S 2> ${folder}${input}.out > ${folder}${input}.sam |
We asked the mapper to create a sam file with mapping results. In the same way we could create a bam file. While SAM files can be inspected using Linux commands (head, less, grep, ...), BAM format is compressed and requires a special parser to read the file. Samtools is used to view bam files but it can also be used to analyze sam files.
Look at this very informative wiki on samtools and the official manual of samtools. The manual does not document some of the commands, so it is better to first look in the wiki to find the command you need and then look in the manual to have an overview of the options it uses.
We will use samtools to get a rough idea of the quality of the mapping. Look at the samtools wiki to see which command you need for getting the basic statistics of a sam file.
Command to get the basic stats of the mapping file. |
---|
On the samtools wiki
you can see that you need the samtools flagstat command for this. |
However samtools flagstat expects a bam file as input. So look at the samtools wiki to see which command you need for transforming a sam into a bam file.
Command to convert sam into bam files. |
---|
On the samtools wiki you can see that you need the samtools view command for this. |
For the exact command you need to know if the sam file contains a header. Let's assume that the sam file indeed contains a header (it does, I checked). The symbolic link for samtools is samtools-0.1.18 Notice that we include the version number of bowtie and samtools in the symbolic link because we have mutiple versions of bowtie and samtools installed on the laptops.
Add the command for transforming the sam into a bam file to your script |
---|
folder=/home/bits/NGS/ChIPSeq/ input=SRR576933 bowtie-1.1.2 ${folder}Escherichia_coli_K12 -q ${folder}${input}.fastq -v 2 -m 1 -3 1 -S 2> ${folder}${input}.out > ${folder}${input}.sam samtools-0.1.18 view -bS ${folder}${input}.sam > ${folder}${input}.bam |
Add the command for analyzing the bam file to your script |
---|
folder=/home/bits/NGS/ChIPSeq/ input=SRR576933 bowtie-1.1.2 ${folder}Escherichia_coli_K12 -q ${folder}${input}.fastq -v 2 -m 1 -3 1 -S 2> ${folder}${input}.out > ${folder}${input}.sam samtools-0.1.18 view -bS ${folder}${input}.sam > ${folder}${input}.bam samtools-0.1.18 flagstat ${folder}${input}.bam |
Bash scripts all have one characteristic: the first line of a bash script is always the following:
#!/bin/bash
This tells the system which program should be used to interpret the script (in this case: /bin/bash)
Add this line to your script |
---|
So the full script becomes:
#!/bin/bash folder=/home/bits/NGS/ChIPSeq/ input=SRR576933 bowtie-1.1.2 ${folder}Escherichia_coli_K12 -q ${folder}${input}.fastq -v 2 -m 1 -3 1 -S 2> ${folder}${input}.out > ${folder}${input}.sam samtools-0.1.18 view -bS ${folder}${input}.sam > ${folder}${input}.bam samtools-0.1.18 flagstat ${folder}${input}.bam |
Save the script as "my_mapping" in the /home/bits/NGS folder.
Check permissions of the script and change them if needed. |
---|
Go to the folder where you have saved the script: /home/bits/NGS and type
ll The script is not executable:
Make it executable by typing: chmod 755 my_mapping ll |
To run the script make sure you are in folder containing the script (/home/bits/NGS) and type:
./my_mapping
The samtools flagstat command displays an overview of the alignment results on your screen. The results are not very informative because the data set comes from a single-end sequencing experiment. You just see that 62% of the reads were mapped. This may seem low but remember that we haven't done any cleaning on the file. According to FASTQC the file contains about 30% of adapter sequences that will not map.
Repeat the analysis for the control sample SRR576938.fastq These two fastq files come from a ChIP-Seq experiment, the first contains the reads of the ChIP sample, the second of the control sample, which consists of fragmented genomic DNA. You need both to identify regions in the genome that are represented more in the ChIP reda than in the control (these are the regions that bind to the transcription factor).
Repeat the analysis for sample SRR576938.fastq ? |
---|
Repeating the mapping is easy now the only thing you need to do is changing the value of the input variable in the script:
|
How many reads of the control sample were mapped ? |
---|
In the flagstat results, you see that 95% of the reads was mapped. This is of course ok but you expected a high percentage here since the control sample is nothing more than the reference genome cut up into small pieces. |
At this point, you have two sam and two bam files, one for the treated sample, one for the control sample.
For paired-end data flagstat results are much more informative, see an example below:
This overview deserves some explanation:
- nan means Not A Number (e.g: divided by 0 )
- paired in sequencing means reads that belong to a pair regardless of the fact that they are really mapped as a pair
- read1 means forward reads
- read2 means reverse reads
- properly paired means that both mates of a read pair map to the same chromosome, oriented towards each other, and with a sensible insert size
- with itself and mate mapped means that both reads of a pair map to the genome but they are not necessarily properly paired, they just map somewhere on the genome
- singletons means that one of the reads of a pair is unmapped while its mate is mapped
- with mate mapped to a different chr means reads with a mate mapped on a different chromosome
- with mate mapped to a different chr (mapQ >= 5) means reads with a mate mapped on a different chromosome having a mapping quality greater than 5
Compare the number of forward and reverse reads in the paired-end experiment. |
---|
the counts of forward and reverse reads are to be found on the lines ending with read1 and read2 respectively. As you see the number of reverse reads exceeds the number of forward reads by 439. |
How many reads were mapped as a pair in the paired-end experiment? |
---|
12.911.388 reads were properly mapped as a pair, that's 85,68% of the total number of reads |
You can find similar info in the SRR576933.out file in the ChIPSeq folder (using the less command), which also contains some statistics about the mapping.
How many reads were mapped according to this file ? |
---|
You see that 62% of the reads was mapped, which is good considering 30% of the reads contained adapter sequences. Type q to leave the less editor. This result is in agreement with the result of the samtools flagstat command. |
Visualize mapping in IGV
IGV is installed on the bits laptops and can be run using the igv command.
igv
We open the bam file that was generated by the Picard modules in IGV. The bam file contains Arabidopsis reads. This means we have to visualize them on the Arabidopsis genome. Change the genome in IGV from Human hg19 to A. thaliana (TAIR10).
This should display the Arabidopsis genome in the top and the bottom view.</p>
Now it's time to load the mapped reads via File in the top menu and Load from File.
Select the .bam file to open. You don't need to load the .bai file, it's suffcient that it is present in the same folder as the .bam file.
This loads the data into the center view. At this point, you can't see the reads, you have to zoom in to view them.
According to the supplemental material accompanying the paper describing this data set, AT1G02930 is highly expressed in all samples and differentially expreesed during the defense response in A. thaliana. So we will zoom in on this gene. You can do this by typing the accession number in the top toolbar and clicking Go:
The reads for this gene are now visualized in the center view. You can zoom in even more using the zoom bar in the top toolbar:
Zoom in until you see the nucleotides of the reference sequence.
The reads are represented by grey arrows, the arrow indicating the orietation of the mapping. Hovering your mouse over a read gives additional info on the mapping. The colored nucleotides indicate mismatches between the read and the reference. Alignments that are displayed with light gray borders and white fill, have a mapping quality equal to zero. Interpretation of this mapping quality depends on the mapper as some commonly used mappers use this convention to mark a read with multiple alignments. In such a case, the read also maps to another location with equally good placement. It is also possible the read could not be uniquely placed but the other placements do not necessarily give equally good quality hits.
By default IGV calculates and displays the coverage track (red) for an alignment file. When IGV is zoomed to the alignment read visibility threshold (by default 30 KB), the coverage track displays the depth of the reads displayed at each locus as a gray bar chart. If a nucleotide differs from the reference sequence in greater than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base (A, C, G, T). You can view count details by hovering the mouse over a coverage bar: