- Optional Items -- Suppose we wanted to search for the word June, which could be represented as either June or Jun. The ? meta-character means optional. So the search could be accomplished using June?. This can be interpreted as match J, then u then n followed by e if its there. So the ? is placed after the character that is allowed to appear at that point in the regular expression, but the existence isn't required to consider it a successful match. Expanding on the same lets say we have to match June 5th, which can occur as Jun 5, June 5th, or fifth. To match the following we can use (June|Jun).(5|5th|fifth). The first expression can be simplified as (June?) and the second one as fifth|5(th)?. Hence the complete expression can be re-written as June?.(fifth|5(th)?). One point here is that although there are different alternatives, choosing the right one needs some thought and introspection.
- Repetition -- The + and * meta-characters are used for checking for repetition. + implies one or more of the preceding character and * implies any number including none of the item. Thus + fails if there are none of the characters but succeeds for one or many, * on the other hand always succeeds. As an example, to match spaces, one can use . ? which would match a single optional space, . + would match any number of spaces with at least one space and . * would match any number of spaces, if present. Lets take another example, where we have to search for the html tag <hr size="10">. To build the regular expression, we need to have<, then HR followed by one or more spaces(allowed as per the semantics of html), followed by zero or more spaces, followed by = followed by zero or more spaces the number 10 , again zero or more spaces followed by >. Based on our discussion so far, this should be simple,
Now lets try and generalise this to any size, not just 10, so the regular expression would look like For case insensitive search using egrep, we can use the -i option.
is a legal match.
- Back References -- Many tools provide the ability for parentheses to remember text matched by the sub-expression they invoke. So as an example, ([a-z])([0-9])\1\2, \1 refers to the text matched by [a-z] and \2 refers text matched by [0-9]. We will see examples of using back-referencing in following posts.
- Escapting Meta-characters -- In order to escape meta-characters, we can use \ to escape the meta-character. For example if we tried to match att.com, it could end up matching something like watt company. So we need to escape the meta-character . , this can be done using att\.com. This escapes the meta-character and treats the meta-character . like a normal character