Text analysis on the command line
Go back to parent page Introduction to Linux for bioinformatics
Contents
Extract some lines
Download the bed file from http://data.bits.vib.be/pub/trainingen/Linux/TAIR9_mRNA.bed via command line |
---|
wget https://dl.dropbox.com/u/18352887/TAIR9_mRNA.bed |
Look at the first 10 lines of this file. |
---|
$ head TAIR9_mRNA.bed chr1 2025600 2027271 AT1G06620.1 0 + 2025617 2027094 0 3 541,322,429, 0,833,1242, chr5 2625558 2628110 AT5G08160.1 0 - 2625902 2627942 0 6 385,143,144,186,125,573, 2167,1523,1269,928,659,0, chr5 2625558 2628110 AT5G08160.2 0 - 2625902 2627942 0 7 258,19,143,144,186,125,573, 2294,2167,1523,1269,928,659,0, chr4 12006985 12009520 AT4G22890.5 0 + 12007156 12009175 0 10 370,107,97,101,57,77,163,98,80,263, 0,802,1007,1196,1392,1533,1703,1945,2120,2272, chr4 12007040 12009206 AT4G22890.2 0 + 12007156 12009175 0 9 315,113,97,101,57,77,163,98,101, 0,741,952,1141,1337,1478,1648,1890,2065, chr4 12006985 12009518 AT4G22890.3 0 + 12007156 12009175 0 10 370,113,97,101,57,77,163,98,80,257, 0,796,1007,1196,1392,1533,1703,1945,2120,2276, chr4 12006985 12009520 AT4G22890.4 0 + 12007156 12009175 0 10 370,104,97,101,57,77,163,98,80,263, 0,805,1007,1196,1392,1533,1703,1945,2120,2272, chr4 12006985 12009520 AT4G22890.1 0 + 12007156 12009175 0 10 370,113,97,101,57,77,163,98,80,263, 0,796,1007,1196,1392,1533,1703,1945,2120,2272, chr2 14578539 14581727 AT2G34630.2 0 + 14578688 14581632 0 11 293,93,81,72,132,87,72,86,133,189,275, 0,797,1120,1320,1488,1711,1898,2165,2435,2649,2913, chr2 14578629 14581727 AT2G34630.1 0 + 14579725 14581632 0 11 203,96,81,72,132,87,72,86,133,189,275, 0,704,1030,1230,1398,1621,1808,2075,2345,2559,2823, |
This is a typical bioinformatics text file, with every row divided in field by tabs.
Extract all lines that start with chr1 from the TAIR9_mRNA.bed and put them in a new text file “chr1_TAIR9_mRNA.bed”. |
---|
$ grep "^chr1" TAIR9_mRNA.bed > chr1_TAIR9_mRNA.bed |
Checking the data
Download human chromosome 21 from https://data.bits.vib.be/pub/trainingen/Linux/Homo_sapiens.dna.chromosome21.zip |
---|
wget ftp://ftp.ensembl.org/pub/release-73/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.73.dna.chromosome.21.fa.gz |
Unzip the file |
---|
gunzip Homo_sapiens.GRCh37.73.dna.chromosome.21.fa.gz |
Entries in a fasta file start with >
How many entries are in that fasta file? Remember you can combine commands with a |. |
---|
grep "^>" Homo_sapiens.GRCh37.73.dna.chromosome.21.fa | wc -l |
How many ?
Use the TAIR9_mRNA.bed file used in the first exercise. Remember it looks like this
chr1 2025600 2027271 AT1G06620.1 0 + 2025617 2027094 0 3 chr5 2625558 2628110 AT5G08160.1 0 - 2625902 2627942 0 6 chr5 2625558 2628110 AT5G08160.2 0 - 2625902 2627942 0 7 chr4 12006985 12009520 AT4G22890.5 0 + 12007156 12009175 0 10 chr4 12007040 12009206 AT4G22890.2 0 + 12007156 12009175 0 9
If you want to find entries that lie on the + strand of a certain chromosome, you need to find lines that start with the chromosome number and that contain a + sign. The number of characters between the chromosome number and the + sign is variable.
How many genes are lying on the + strand of the first chromosome ? |
---|
Since you need to use the + sign to represent a set of characters of variable length you need to use egrep for this:
grep "^chr1.+\+" TAIR9_mRNA.bed | wc -l |
More complex extraction
Get the last exon size for all mRNA records in Arabidopsis. Use TAIR9_mRNA.bed for this: this file contains the exon sizes. See the BED page to check that the field we need is field 11. This contains a comma separated list of the sizes of all the exons of a mRNA
Get the exon sizes for all mRNA records in Arabidopsis. Write them to a file called exons.txt |
---|
awk '{ print $11 }' TAIR9_mRNA.bed > exons.txt |
Take a look at the first 10 lines of exons.txt |
---|
head exons.txt |
If we try to print the last field with awk, using ',' as a delimiter, things go wrong:
awk -F',' '{ print $NF }' > lastexons.txt
The reason is that the last field is empty, because the lines end with a ','. We need to remove the last ',' and can use sed for this.
Remove the last comma from the lines and save in a file called exonsclean.txt |
---|
You want to substitute the comma at the end of the line by nothing:
sed 's/,$//' exons.txt > exonsclean.txt head exonsclean.txt |
Fetch the last field from exonsclean.txt and save in a file called lastexons.txt |
---|
awk -F',' '{ print $NF }' exonsclean.txt > lastexons.txt head lastexons.txt |
Sort exonsizes from largest to smallest into a file called lastexonssort.txt |
---|
sort -nr lastexons.txt > lastexonssort.txt head lastexonssort.txt |
You can use uniq to summarize the results
uniq -c lastexonssort.txt | head 2 6885 1 5616 1 5601 1 5361 1 5239 1 4688 2 4470 1 4446 1 4443 1 4275
Analyzing a short read alignment
SAM ('sequence alignment map') file format is the format which summarizes the alignment of reads to a reference genome. Is is one of the key files in NGS analysis, and you can learn a lot from it. See .sam for a description of this format.
Download the sam file from http://data.bits.vib.be/pub/trainingen/Linux/sample.sam |
---|
wget http://data.bits.vib.be/pub/trainingen/Linux/sample.sam |
How many lines has the SAM file? |
---|
wc -l sample.sam |
100015 lines
How many lines start with '@', which is the comment symbol in the SAM format. |
---|
grep '^@' sample.sam | wc -l |
15
You can use grep to skip the lines starting with '@', since they are comment lines.
grep -v '^@' sample.sam | head
Write the FLAG field (second field) to a file called flags.txt |
---|
Pipe the grep results to awk to print the second field.
grep -v '@' sample.sam | awk '{ print $2 }' > flags.txt head flags.txt |
Sort and summarize (using uniq) flags.txt |
---|
Pipe the grep results to awk to print the second field.
sort -nr flags.txt | uniq -c |
Sort the results on number of times observed (the first field) |
---|
We build on the previous command, and just pipe the output to sort -nr. We do not have to use the option -k, since sort always takes the first field.
sort -nr flags.txt | uniq -c | sort -nr |
Advanced
We use the TAIR9_mRNA.bed to answer this.
First we check how many different genes are in the file. A gene has the code AT<number>G<number>. Splice variants have to same AT number but different version number (the numbers after the . are different. We are not interested in splice variants so want to remove the .1, .2... before counting. You can do this by using the . as a field delimiter
Remove everything after the . and save in a file called TAIRpart.txt |
---|
awk -F'.' '{ print $1 }' TAIR9_mRNA.bed > TAIRpart.txt head TAIRpart.txt |
Now you need to summarize the fourth column of this file and count the lines of the result
How many different genes are in the file. |
---|
cut -f4 TAIRpart.txt | sort | uniq | wc -l |
27379
When you look at TAIR9_mRNA.bed you see that the the fifth column contains 0.
Check if there is any entry that contains another number in that column ? (summarize will give you the answer) |
---|
cut -f5 TAIR9_mRNA.bed | sort -nr | uniq -c |
No
Another example: Show all Arabidopsis mRNA with more than 50 exons
awk '{ if ($10>50) print $4 }' TAIR9_mRNA.bed
Print the number of exons (field number 10) of mRNAs from the first chromosome. |
---|
grep '^chr1' TAIR9_mRNA.bed | awk '{ print $10 }' |
Obtain AT numbers (field 4) and exon info (field 11) |
---|
awk '{ print $4,",",$11 }' TAIR9_mRNA.bed |
Go back to parent page Introduction to Linux for bioinformatics