Friday, June 20, 2008

Regular Expressions Part 1

I have been a great fan of using regular expressions for solving day to day tasks. I thought it was time to post a series for discussing the use and syntax of regular expressions. In this post i will talk about some simple meta-characters and their uses with tools like egrep/grep. In addition i will try and set some small exercises to practice the understanding of the readers.

A regular expression is composed of a set of meta-characters like *, ^ etc and a set of literal characters like A-Z, a-z, 0-9 etc.

Some Simple Meta-characters
Let us start with a some simple meta-characters and how we can apply these to various scenario's. The
  • ^ - The Caret/Anchor is one of the simplest meta-characters and represents the start of the line which is being checked. For example ^cat would match all lines which start with the literal cat like catalog
  • $ - The dollar meta-character represents the end of the line which is being checked. So cat$ would match all lines which end with the literal characters cat, for example scat.
These 2 meta-characters are unique in the sense that they match a position rather than the actual literal characters themselves.

Exercise -- What would ^cat$, ^$ and ^ match ?


  • Character Class -- The meta-character [...] called the character class lets you list the characters you want to allow at that point of match. For example e matches just e , a matches just a, however [ea] would match both e or a. Thus the implied meaning is that of OR. Suppose you wanted to check whether you have spelt grey or gray. The following regular expression would match the occurance of both gr[ea]y. This matches g follows by r followed by either e or a followed by y. The - operator within a character class matches a range, so [0-9A-Z] matches occurances of numbers and capital letters. For example ] would help you while searching headers in html documents. The - is special inside a character class, otherwise it is used as a normal character and not as a range.
  • Negated Character classes -- A ^ inside a character class negates the matches. For example [^a-z] would match any character which is not a through z. Remember a negated character class means match a character thats not listed and not don't match whats listed
Exercise -- Figure out all words in a dictionary which contain a q not followed by a u
  • Matching Any character with dot -- this is simpler to explain with an example. Suppose you want to find a date 26/02/1977, which can also be represented as 26-02-1977 or even 26.02.1977. Not the . character can be used to match any character. So seemingly, 26.02.1977 should return us the date however it be represented, maybe with a - or a /. However, the above would also match a random number like 12264023197734. This is because the dot character matches any character. The correct way to do so would be to use something like 26[-./]02[-./]1977 (Note, the - is this case would not be a meta-character, since it immediately follows the [, if the same was written as [.-/], then it would become the range meta-character, also the . is not treated as a meta-character since it is inside the character class)
  • Matching any one of several sub-expressions -- The | (OR) operator lets you combine expressions into a single expression, that matches any of the individual ones. For example, based on the previous examples, we can rewrite gr[ea]y as gr(e|a)y. (Note-- we cannot use something like gr[e|a]y as inside the character class, | is treated as a normal character). We need the parenthesis because gre|ay would mean gre or ay.
Exercises 1) How does gr[ea]y differ from gr(e|a)y ?
2) What is the difference between ^:List|Of|Books: from ^(List|Of|Books): ?
Note -- Tools like egrep provide switches for case-insensitive searches. For egrep -i performs a case insensitive search.
Example -- egrep -i '^(Subject|Predicate):' myfile.txt
This serves as a good groundwork for someone wanting to start using regular expressions. I will be following up the post with detailed topics. Till then, cheers and keep the faith.

The emperor and me beaching

The Devil next door

Kaiser The Emperor