An example about parsing through a list of files

From BITS wiki
Jump to: navigation, search
Go back to Perl introductionary training#Exercises

Today it is common that scientific researchers have to handle huge amounts of data and precisely Perl provides an easy tool to automate the management of data sets. We will start the exercise session of today with a simple example : suppose you have a list with human proteins and you want to know which is the most similar mouse protein. A rudimentary (but occasionally really used) method is to do a BLAST search against a databank of mouse sequences and take each time the first "hit". The directory blastresults contains a set of BLAST outputs. Open a "DOS box", go to the directory Perl and do :

dir blastresults

You will see that the directory blastresults contains a series of BLAST output files as well as a file called NOTES. We need to parse one by one through all the BLAST outputs. Perl has an opendir command that creates a "directoryhandle" and a readdir command that each time it is executed on a directoryhandle reads the name of one file. Write a program :

#!/usr/bin/perl

opendir DIR, 'blastresults';
while ($filename = readdir DIR) {
  print "$filename\n";
}
closedir DIR;

This program will write on the screen the names of all the files in the directory, just as the DOS command dir does. We have a first problem : we must parse only through the BLAST output files xxx.blastp but the list of files also contains NOTES, as well as '.' (= the directory itself) and '..' (= the directory one level "higher up"). This problem can easily be solved using a "regular expression". Modify the program so that it becomes :

#!/usr/bin/perl

opendir DIR, 'blastresults';
while ($filename = readdir DIR) {
  if ($filename =~ /\.blastp$/) {
    print "$filename\n";
  }
}
closedir DIR;

If you remember everything you learned during exercise 2 of the first session you will note the use of \. to indicate that we want to match a dot instead of any character and the $ to indicate that the p must be the last character of the name. Now we must parse through the files ; we can do this by re-using the same filehandle over and over again. Let's convince ourselves that it works by simply printing the first line of every BLAST output file. Modify the program so that it becomes :

#!/usr/bin/perl

opendir DIR, 'blastresults';
while ($filename = readdir DIR) {
  if ($filename =~ /\.blastp$/) {
    open FILE, "blastresults/$filename";
    $line = <FILE>;
    print $line;
    close FILE;
  }
}
closedir DIR;

You will note that since we are running the program from the directory Perl and the files are located inside blastresults it is necessary to put before each filename the path that the operating system of your computer must follow in order to find the file while starting from the working directory. Those who are familiar with DOS shall note that inside the Perl program we use '/' as "separator" and not '\' ; this is of course because Perl was developed in a UNIX environment.

What we want to obtain is a list of lines of type :

ACHB2_HUMAN	achb2_mouse
CAD22_HUMAN	cad22_mouse
...

Try to modify the program so as to obtain this. The answer is not on this exercise sheet, we will let you work a while on it and then display a possible solution on the screen. Some hints : start of course by taking a good look at the structure of the BLAST output ; you can use the Notepad to do that ; note :

  • the identifier of the human sequence is on a line that starts with Query=
  • there is a line Sequences producing significant alignments: ; if you read 2 times a line beyond you get a line that contains as first word the identifier of the mouse sequence.

You must also make that the output is written in a file rather than on the screen.

Go back to Perl introductionary training#Exercises