Introduction to Linux for bioinformatics
From BITS wiki
Linux is a very popular operating system in bioinformatics. In this training you will learn why that is and how it can help you with your bioinformatics analysis. After this training you will be able to:
- install software on Linux
- use command line to run tools
- use command line to handle files
- write small scripts to automate your analysis
Contents
Training material
Additional information
Exercises during the training
Excercises: Part 1
On the training there is a Linux Ubuntu installation available on a Google cloud environment. To access Linux we use Google Chrome and the 'VNC Viewer for Google Chrome' application.
When you launch the application, you have to enter an IP address, this will be mentioned on the training.
Installing Linux
- Create a bootable live USB drive for testing distributions
- Create a virtual machine running a Linux distribution using VirtualBox
Installing software
Install the tools from the presentation slides. Note that when you install something, everyone in the training has access to that tool!
Here are some exercises to try on your personal Linux installation:
- Install Geany, a powerful, light-weight programming editor
- Adding a new software repository for bioinformatics
- Example of downloading and installing an .deb file
- Download and install the Kent tools, a collection of binaries and scripts
- optional Adding repositories containing bioinformatics tools in Red Hat-derived distributions
- optional Useful bioinformatics software packages in Linux
- Additional Bioinformatics repositories
Command line
File system
Bonus
Excercises: Part 2
Text mining, scripting and 'for' loops
- Simple bash script
- Download a simple perl script
- Text analysis on the command line
- Bash aliases to enhance your productivity
- Writing for loops
NGS intro
Bonus
EXTRA
Bioinformatics oneliners
Sneak preview to duplication rate of reads
gunzip -dc fastq.gz | head -n 1000000 | awk '{ if(NR%4==2) { print $1 } }' | sort | uniq -c | sort -g > sorted_duplicated
Convert fastq to fasta
paste - - - - < in.fq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > out.fa
Count all the variants called in all the vcf files
cat *.vcf | grep -v '^#' | wc -l
Count all the variants in three vcf files
cat *.raw.vcf | grep -v '^#' | awk '{print $1 "\t" $2 "\t" $5}' | sort | uniq -c | grep ' 3 ' | wc -l