Parsing files

From BITS wiki
Jump to: navigation, search
Go back to Perl introductionary training#Exercises

Often you will have to parse a text from a file. As a first example, let's take the following simple program secondline.pl that just displays the second line of a text file :

#!/usr/bin/perl
open FILE, $ARGV[0];
<FILE>;
$line = <FILE>;
print $line;
close FILE;

Try it out on a file you write yourself or on text4perl (you can check the content of text4perl by opening it with the WordPad or by typing type text4perl in the DOS box).

Some explanation : The first line of the program contains a statement open XXX, xxx, which opens a file for reading. xxx is a text string corresponding to a valid name of a file. The string might also be generated by an expression that returns a text string. XXX is a so-called "filehandle" ; once the file has been opened, we will in the Perl program refer to the file only by its filehandle and not by its name. It is common usage among Perl programmers to use for filehandles a name all in uppercase, so that one can easily see that it is a filehandle.

The second line of the program uses the "angle operator" <XXX>, with XXX a filehandle. This reads one line of the file and returns this line (including the end-of-line). The program keeps track of where it is in the file, so that each time the angle operator is used on a filehandle, you get the next line. Note that the second line of the program reads the first line of the file but does nothing with it. The third line of the program reads the second line of the file but this time stores its content into the variable $line.

It is good practice to "close" a file when you do not need it anymore, although all open files are closed anyway when the program terminates.

You might wonder what happens if mytext does not exist (e.g. because you mistyped while typing in the command) or that it does exist but cannot be opened because of file permissions. Well, you will see no output, because the filehandle FILE will have nothing. Try ! You can make the program fool-proof by replacing the first line by :

open FILE, $ARGV[0] or die "Cannot open $ARGV[0]\n";

This is again a typical habit of Perl programmers. How does it work ? The Perl operator or returns true (1) or false (0) as you would expect from proposition logic. The gimmick is however that if xxx in xxx or yyy is true the Perl interpreter does not waste effort in evaluating yyy, since the final result is true anyway. This makes that if yyy contains some command, this command will not be executed ; yyy could even contain a statement that would make the program crash, but as long as xxx returns true this will not harm. The die function prints a message and terminates the program.

You can also write in files using the following statements :

open XXX, ">xxx"	open a file named xxx for writing. Note that if a file xxx exists already,
    its content will be lost !
open XXX, ">>xxx"	open file xxx in "append" mode, that is start adding text at its end
print XXX xxx	        write xxx in the file that has filehandle XXX, where xxx is a string, a variable or a
    command that returns some text. Note that there is no comma ',' between XXX and xxx, precisely because
    if you want to write to the screen, there would be no XXX and since xxx can consist of a series of
    comma-separated statements, the Perl interpreter would not be able to make the distinction between
    XXX and the first part of xxx.

Furthermore, a Perl program has de facto access to the filehandles STDIN, STDOUT and STDERR. These names have their origin in the UNIX operating system : STDIN is the "standard input", usually that what you type on the keyboard, STDOUT and STDERR are the "standard output" and the "standard error" ; they are written on the screen.

Now let's do something more serious. The file modelgenerator0.out contains an output of Thomas Keane's ModelGenerator, which takes as input a multiple sequence alignment and tries to determine which model of base/amino acid evolution fits best the data. Write the following program that parses through the file and retrieves the model selected according to the Akaike Criterion 2.

#!/usr/bin/perl
open FILE, 'modelgenerator0.out';
while (<FILE>) {
  if (/Akaike Information Criterion 2 \(AIC2\)/) {
    $readmodel = 1;
  } elsif ($readmodel and /Model Selected/) {
    $_ =~ /Model Selected: (.+)/;
    print "$1\n"
    exit;
  }
}
close FILE;

This program must be able to repeat a procedure till some goal is achieved and therefore uses the syntax while (xxx) {yyy}. This will execute over and over again the list of commands yyy as long as the condition xxx is true. A further special feature of Perl is that if you write while (<XXX>) {yyy}, where XXX is the filehandle of a file opened for reading, the loop will as long as there are lines left browse through the file and each time put the content of a line into the special variable $_. Furthermore, if (/xxx/) is a shortcut for if $_ =~/xxx/). So, instead of :

while (<FILE>) {
  if (/Akaike Information Criterion 2 \(AIC2\)/)
  ...
  elsif ($readmodel and /Model Selected/)
    $_ =~ /Model Selected: (.+)/;

you could as well have written :

while (<FILE>) {
  if ($_ =~ /Akaike Information Criterion 2 \(AIC2\)/)
     ...
  elsif ($readmodel and $_ =~ /Model Selected/)
    $_ =~ /Model Selected: (.+)/;

or :

while ($line = <FILE>) {
  if ($line =~ /Akaike Information Criterion 2 \(AIC2\)/)
  ...
  elsif ($readmodel and $line =~ /Model Selected/)
    $line =~ /Model Selected: (.+)/; 

Note also the logic of the program. The string Model Selected occurs as much as 4 times in the file but we are only interested in the one that comes just after AIC2. A classic programmers' trick is to use a variable (here $readmodel) to keep track of where we are in the file.

You could try to improve the program by writing the selected model into a file. Do note that you will then have to perform two separate "open" operations in order to create two different filehandles, one to access modelgenerator0.out in read mode and another one to access whatever you call your output file in write mode.

Go back to Perl introductionary training#Exercises