Text analysis on the command line

From BITS wiki
Jump to: navigation, search
Go back to parent page Introduction to Linux for bioinformatics

Extract some lines

This is a typical bioinformatics text file, with every row divided in field by tabs.


Checking the data

Entries in a fasta file start with >


How many ?

Use the TAIR9_mRNA.bed file used in the first exercise. Remember it looks like this

chr1	2025600	2027271	AT1G06620.1	0	+	2025617	2027094	0	3	
chr5	2625558	2628110	AT5G08160.1	0	-	2625902	2627942	0	6	
chr5	2625558	2628110	AT5G08160.2	0	-	2625902	2627942	0	7	
chr4	12006985	12009520	AT4G22890.5	0	+	12007156	12009175	0	10	
chr4	12007040	12009206	AT4G22890.2	0	+	12007156	12009175	0	9

If you want to find entries that lie on the + strand of a certain chromosome, you need to find lines that start with the chromosome number and that contain a + sign. The number of characters between the chromosome number and the + sign is variable.


More complex extraction

Get the last exon size for all mRNA records in Arabidopsis. Use TAIR9_mRNA.bed for this: this file contains the exon sizes. See the BED page to check that the field we need is field 11. This contains a comma separated list of the sizes of all the exons of a mRNA

If we try to print the last field with awk, using ',' as a delimiter, things go wrong:

awk -F',' '{ print $NF }' > lastexons.txt

The reason is that the last field is empty, because the lines end with a ','. We need to remove the last ',' and can use sed for this.

You can use uniq to summarize the results

uniq -c lastexonssort.txt | head
      2 6885
      1 5616
      1 5601
      1 5361
      1 5239
      1 4688
      2 4470
      1 4446
      1 4443
      1 4275



Analyzing a short read alignment

SAM ('sequence alignment map') file format is the format which summarizes the alignment of reads to a reference genome. Is is one of the key files in NGS analysis, and you can learn a lot from it. See .sam for a description of this format.

100015 lines

15

You can use grep to skip the lines starting with '@', since they are comment lines.

grep -v '^@' sample.sam | head


Advanced

We use the TAIR9_mRNA.bed to answer this.

First we check how many different genes are in the file. A gene has the code AT<number>G<number>. Splice variants have to same AT number but different version number (the numbers after the . are different. We are not interested in splice variants so want to remove the .1, .2... before counting. You can do this by using the . as a field delimiter

Now you need to summarize the fourth column of this file and count the lines of the result

27379

When you look at TAIR9_mRNA.bed you see that the the fifth column contains 0.

No

Another example: Show all Arabidopsis mRNA with more than 50 exons

awk '{ if ($10>50) print $4 }' TAIR9_mRNA.bed

Go back to parent page Introduction to Linux for bioinformatics