Go back to parent page Introduction to Linux for bioinformatics

Extract some lines

Download the bed file from http://data.bits.vib.be/pub/trainingen/Linux/TAIR9_mRNA.bed via command line
wget https://dl.dropbox.com/u/18352887/TAIR9_mRNA.bed

Look at the first 10 lines of this file.

 $ head TAIR9_mRNA.bed 
chr1	2025600	2027271	AT1G06620.1	0	+	2025617	2027094	0	3	541,322,429,	0,833,1242,
chr5	2625558	2628110	AT5G08160.1	0	-	2625902	2627942	0	6	385,143,144,186,125,573,	2167,1523,1269,928,659,0,
chr5	2625558	2628110	AT5G08160.2	0	-	2625902	2627942	0	7	258,19,143,144,186,125,573,	2294,2167,1523,1269,928,659,0,
chr4	12006985	12009520	AT4G22890.5	0	+	12007156	12009175	0	10	370,107,97,101,57,77,163,98,80,263,	0,802,1007,1196,1392,1533,1703,1945,2120,2272,
chr4	12007040	12009206	AT4G22890.2	0	+	12007156	12009175	0	9	315,113,97,101,57,77,163,98,101,	0,741,952,1141,1337,1478,1648,1890,2065,
chr4	12006985	12009518	AT4G22890.3	0	+	12007156	12009175	0	10	370,113,97,101,57,77,163,98,80,257,	0,796,1007,1196,1392,1533,1703,1945,2120,2276,
chr4	12006985	12009520	AT4G22890.4	0	+	12007156	12009175	0	10	370,104,97,101,57,77,163,98,80,263,	0,805,1007,1196,1392,1533,1703,1945,2120,2272,
chr4	12006985	12009520	AT4G22890.1	0	+	12007156	12009175	0	10	370,113,97,101,57,77,163,98,80,263,	0,796,1007,1196,1392,1533,1703,1945,2120,2272,
chr2	14578539	14581727	AT2G34630.2	0	+	14578688	14581632	0	11	293,93,81,72,132,87,72,86,133,189,275,	0,797,1120,1320,1488,1711,1898,2165,2435,2649,2913,
chr2	14578629	14581727	AT2G34630.1	0	+	14579725	14581632	0	11	203,96,81,72,132,87,72,86,133,189,275,	0,704,1030,1230,1398,1621,1808,2075,2345,2559,2823,

This is a typical bioinformatics text file, with every row divided in field by tabs.

Extract all lines that start with chr1 from the TAIR9_mRNA.bed and put them in a new text file “chr1_TAIR9_mRNA.bed”.
$ grep "^chr1" TAIR9_mRNA.bed > chr1_TAIR9_mRNA.bed

Checking the data

Download human chromosome 21 from https://data.bits.vib.be/pub/trainingen/Linux/Homo_sapiens.dna.chromosome21.zip
wget ftp://ftp.ensembl.org/pub/release-73/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.73.dna.chromosome.21.fa.gz

Unzip the file
gunzip Homo_sapiens.GRCh37.73.dna.chromosome.21.fa.gz

Entries in a fasta file start with >

How many entries are in that fasta file? Remember you can combine commands with a \|.
grep "^>" Homo_sapiens.GRCh37.73.dna.chromosome.21.fa \| wc -l

How many ?

Use the TAIR9_mRNA.bed file used in the first exercise. Remember it looks like this

chr1	2025600	2027271	AT1G06620.1	0	+	2025617	2027094	0	3	
chr5	2625558	2628110	AT5G08160.1	0	-	2625902	2627942	0	6	
chr5	2625558	2628110	AT5G08160.2	0	-	2625902	2627942	0	7	
chr4	12006985	12009520	AT4G22890.5	0	+	12007156	12009175	0	10	
chr4	12007040	12009206	AT4G22890.2	0	+	12007156	12009175	0	9

If you want to find entries that lie on the + strand of a certain chromosome, you need to find lines that start with the chromosome number and that contain a + sign. The number of characters between the chromosome number and the + sign is variable.

How many genes are lying on the + strand of the first chromosome ?
Since you need to use the + sign to represent a set of characters of variable length you need to use egrep for this: grep "^chr1.+\+" TAIR9_mRNA.bed \| wc -l

More complex extraction

Get the last exon size for all mRNA records in Arabidopsis. Use TAIR9_mRNA.bed for this: this file contains the exon sizes. See the BED page to check that the field we need is field 11. This contains a comma separated list of the sizes of all the exons of a mRNA

Get the exon sizes for all mRNA records in Arabidopsis. Write them to a file called exons.txt
awk '{ print $11 }' TAIR9_mRNA.bed > exons.txt

Take a look at the first 10 lines of exons.txt
head exons.txt

If we try to print the last field with awk, using ',' as a delimiter, things go wrong:

awk -F',' '{ print $NF }' > lastexons.txt

The reason is that the last field is empty, because the lines end with a ','. We need to remove the last ',' and can use sed for this.

Remove the last comma from the lines and save in a file called exonsclean.txt
You want to substitute the comma at the end of the line by nothing: sed 's/,$//' exons.txt > exonsclean.txt head exonsclean.txt

Fetch the last field from exonsclean.txt and save in a file called lastexons.txt
awk -F',' '{ print $NF }' exonsclean.txt > lastexons.txt head lastexons.txt

Sort exonsizes from largest to smallest into a file called lastexonssort.txt
sort -nr lastexons.txt > lastexonssort.txt head lastexonssort.txt

You can use uniq to summarize the results

uniq -c lastexonssort.txt | head
      2 6885
      1 5616
      1 5601
      1 5361
      1 5239
      1 4688
      2 4470
      1 4446
      1 4443
      1 4275

Analyzing a short read alignment

SAM ('sequence alignment map') file format is the format which summarizes the alignment of reads to a reference genome. Is is one of the key files in NGS analysis, and you can learn a lot from it. See .sam for a description of this format.

Download the sam file from http://data.bits.vib.be/pub/trainingen/Linux/sample.sam
wget http://data.bits.vib.be/pub/trainingen/Linux/sample.sam

How many lines has the SAM file?
wc -l sample.sam

100015 lines

How many lines start with '@', which is the comment symbol in the SAM format.
grep '^@' sample.sam \| wc -l

15

You can use grep to skip the lines starting with '@', since they are comment lines.

grep -v '^@' sample.sam | head

Write the FLAG field (second field) to a file called flags.txt
Pipe the grep results to awk to print the second field. grep -v '@' sample.sam \| awk '{ print $2 }' > flags.txt head flags.txt

Sort and summarize (using uniq) flags.txt
Pipe the grep results to awk to print the second field. sort -nr flags.txt \| uniq -c

Sort the results on number of times observed (the first field)
We build on the previous command, and just pipe the output to sort -nr. We do not have to use the option -k, since sort always takes the first field. sort -nr flags.txt \| uniq -c \| sort -nr

Advanced

We use the TAIR9_mRNA.bed to answer this.

First we check how many different genes are in the file. A gene has the code AT<number>G<number>. Splice variants have to same AT number but different version number (the numbers after the . are different. We are not interested in splice variants so want to remove the .1, .2... before counting. You can do this by using the . as a field delimiter

Remove everything after the . and save in a file called TAIRpart.txt
awk -F'.' '{ print $1 }' TAIR9_mRNA.bed > TAIRpart.txt head TAIRpart.txt

Now you need to summarize the fourth column of this file and count the lines of the result

How many different genes are in the file.
cut -f4 TAIRpart.txt \| sort \| uniq \| wc -l

27379

When you look at TAIR9_mRNA.bed you see that the the fifth column contains 0.

Check if there is any entry that contains another number in that column ? (summarize will give you the answer)
cut -f5 TAIR9_mRNA.bed \| sort -nr \| uniq -c

No

Another example: Show all Arabidopsis mRNA with more than 50 exons

awk '{ if ($10>50) print $4 }' TAIR9_mRNA.bed

Print the number of exons (field number 10) of mRNAs from the first chromosome.
grep '^chr1' TAIR9_mRNA.bed \| awk '{ print $10 }'

Obtain AT numbers (field 4) and exon info (field 11)
awk '{ print $4,",",$11 }' TAIR9_mRNA.bed

Go back to parent page Introduction to Linux for bioinformatics

Text analysis on the command line

Contents

Extract some lines

Checking the data

How many ?

More complex extraction

Analyzing a short read alignment

Advanced

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox