Matching text with regular expressions

From BITS wiki
Jump to: navigation, search
Go back to Perl introductionary training#Exercises

Perl makes it easy to parse and modify text with the help of so-called regular expressions. For a full description of the regular expression syntax, see the "Quick Reference Guide" page 14 and the Perl books. Let's mention that you can use parentheses '()' and the logical operator or '|' to write e.g. (embl|genbank|ddbj) to tell that you want to match any of the three names. There are special characters for wild carding :

	.	matches anything except an end-of-line (\n)
	\d	matches any digit (same as [0-9])
	\D	matches any non-digit (same as [^0-9])
	\w	matches any word character (same as [0-9a-zA-Z_])
	\W	matches any non-word character (same as [^0-9a-zA-Z_])
	\s	matches any white space (same as [ \t\s\n\r\f\v])
	\S	matches any non-white space (same as [^ \t\s\n\r\f\v])

You can use ^ and $ for respectively the begin and the end of the string of text (not of a line within the string !). You will have noted that you can write your own sets of characters to be matched ; note also that inside square brackets '[]' ^ stands for "not any of the following characters" instead as for begin-of-string and that – when written as first character of the series really means the character – but otherwise is used to indicate a range.

A character or a string within parentheses can be repeated a number of times :

	X{n,m}	X repeated between n and m times
	X*	X repeated any number of times, including 0
	X?	X present or not (same as X{,1})
	X+	X present at least once (same as X{1,})

A regular expression can be matched to a string using the "binding operator" '=~' :

	$xxx =~ /yyy/	returns 'true' if string xxx matches somewhere regular
                        expression yyy
	$xxx =~ /yyy/i	idem, but the match is case-insensitive
	$xxx =~ s/yyy/zzz/	replaces the first range of string xxx that
                                matches regular expression yyy by string zzz
	$xxx =~ s/yyy/zzz/g	replaces all ranges of string xxx that match
                                regular expression yyy by string zzz

Note that zzz can be an "interpreted string" (see my_first_Perl_program) and that the regular expression yyy also can contain variables, which will be replaced by their values. This allows for very flexible syntax and makes the Perl regular expressions a very powerful tool. Note that if you want to match a character that has a special meaning in regular expressions ( \ | ( ) [ ] { } ^ $ + ? * . ), you will have to "escape" it with a backslash, e.g. use \. if you want to match exclusively a dot instead of any character.

Now, let's use this by writing a program checkseq.pl that checks whether a string of characters is acceptable as nucleic acid or protein sequence :

#!/usr/bin/perl
$seq = $ARGV[0];
if ($seq =~ /[JO]/) {
  print "is not a sequence, first illegal character is $&\n";
} elsif ($seq =~ /[EFILPQZ]/) {
  print "is protein\n";
} else {
  print "is nucleic acid\n";
}

and try it out with e.g. checkseq.pl XFTPO. This program must be able to make a choice and therefore uses the syntax if (xxx1) {yyy1} elsif (xxx2) {yyy2} elsif (xxx3) {yyy3} elsif (xxx4) {yyy4} ... else {yyy5}. If the expression xxx1 returns 'true' then the block of commands yyy1 will be executed and the program will skip all the rest and go on with the code after yyy5 ; if xxx1 is not true the program will go for the first elsif followed by a true statement ; if none is true it will execute block yyy5 after the else. Note the special variable $& ; each time a pattern matching operation is performed successfully it gets as value the part of the string that matches the regular expression. You can now use the program to check a few more sequences. The program checkseq.pl can be improved :

  • checkseq.pl only works properly when the sequence is all uppercase. Modify it so as to make it case insensitive. You can find in the text above how to do it.
  • What if the sequence contains non-letter characters ? Modify it so that it rejects any character that is not a letter as illegal.

Test the following program, which transforms DNA into RNA :

#!/usr/bin/perl
$seq = $ARGV[0];
$seq =~ s/T/U/g;
$seq =~ s/t/u/g;
print "$seq\n";

And the following, which searches for a potential prokaryotic ribosome binding site in RNA :

#!/usr/bin/perl
$seq = $ARGV[0];
if ($seq =~ /(AGGA|GGAG|GAGG).{4,10}[AG]UG/) {
  print "found $1\n";
}

The idea is that a ribosome binding site consists of at least 4 consecutive bases of the range AGGAGG (which is complementary to the 16SrRNA) and is located 4 to 10 positions from the start codon AUG or GUG. The string CGGAGCUCCGUGA should match. Note that the variable $1 gets as value the part of the string that corresponds to (AGGA|GGAG|GAGG). In general, if a regular expression contains parentheses '()' the parts of the string that matches what is between parentheses are put in the variables $1, $2, $3, ..., whereby the opening parentheses '(' are counted from left to right.

Go back to Perl introductionary training#Exercises