Sunday, June 22, 2008

Regular Expressions Part 2

This entry continues the discussion regarding regular expressions from the previous post(Regular Expression part 1). We continue with a discussion of the rich meta-characters available. In this post i will take up some of the other meta-characters and discussion their usage. The pattern followed is the same as the previous post with some exercises thrown in for good measure.
  • Optional Items -- Suppose we wanted to search for the word June, which could be represented as either June or Jun. The ? meta-character means optional. So the search could be accomplished using June?. This can be interpreted as match J, then u then n followed by e if its there. So the ? is placed after the character that is allowed to appear at that point in the regular expression, but the existence isn't required to consider it a successful match. Expanding on the same lets say we have to match June 5th, which can occur as Jun 5, June 5th, or fifth. To match the following we can use (June|Jun).(5|5th|fifth). The first expression can be simplified as (June?) and the second one as fifth|5(th)?. Hence the complete expression can be re-written as June?.(fifth|5(th)?). One point here is that although there are different alternatives, choosing the right one needs some thought and introspection.
  • Repetition -- The + and * meta-characters are used for checking for repetition. + implies one or more of the preceding character and * implies any number including none of the item. Thus + fails if there are none of the characters but succeeds for one or many, * on the other hand always succeeds. As an example, to match spaces, one can use . ? which would match a single optional space, . + would match any number of spaces with at least one space and . * would match any number of spaces, if present. Lets take another example, where we have to search for the html tag &lt;hr size="10"&gt;. To build the regular expression, we need to have<, then HR followed by one or more spaces(allowed as per the semantics of html), followed by zero or more spaces, followed by = followed by zero or more spaces the number 10 , again zero or more spaces followed by >. Based on our discussion so far, this should be simple, Now lets try and generalise this to any size, not just 10, so the regular expression would look like For case insensitive search using egrep, we can use the -i option.
Exercise -- Write a regular expression for the same examplem as above, where the size attribute is optional e.i a simple
is a legal match.

  • Back References -- Many tools provide the ability for parentheses to remember text matched by the sub-expression they invoke. So as an example, ([a-z])([0-9])\1\2, \1 refers to the text matched by [a-z] and \2 refers text matched by [0-9]. We will see examples of using back-referencing in following posts.
  • Escapting Meta-characters -- In order to escape meta-characters, we can use \ to escape the meta-character. For example if we tried to match att.com, it could end up matching something like watt company. So we need to escape the meta-character . , this can be done using att\.com. This escapes the meta-character and treats the meta-character . like a normal character
This covers some of the most important meta-characters. In future posts i would expand on these fundamentals and show how applying regular expressions makes life simpler.

The emperor and me beaching

The Devil next door

Kaiser The Emperor