Cogito Ergo Sum: June 2008

Sunday, June 22, 2008

Regular Expressions Part 2

This entry continues the discussion regarding regular expressions from the previous post(Regular Expression part 1). We continue with a discussion of the rich meta-characters available. In this post i will take up some of the other meta-characters and discussion their usage. The pattern followed is the same as the previous post with some exercises thrown in for good measure.

Optional Items -- Suppose we wanted to search for the word June, which could be represented as either June or Jun. The ? meta-character means optional. So the search could be accomplished using June?. This can be interpreted as match J, then u then n followed by e if its there. So the ? is placed after the character that is allowed to appear at that point in the regular expression, but the existence isn't required to consider it a successful match. Expanding on the same lets say we have to match June 5th, which can occur as Jun 5, June 5th, or fifth. To match the following we can use (June|Jun).(5|5th|fifth). The first expression can be simplified as (June?) and the second one as fifth|5(th)?. Hence the complete expression can be re-written as June?.(fifth|5(th)?). One point here is that although there are different alternatives, choosing the right one needs some thought and introspection.
Repetition -- The + and * meta-characters are used for checking for repetition. + implies one or more of the preceding character and * implies any number including none of the item. Thus + fails if there are none of the characters but succeeds for one or many, * on the other hand always succeeds. As an example, to match spaces, one can use . ? which would match a single optional space, . + would match any number of spaces with at least one space and . * would match any number of spaces, if present. Lets take another example, where we have to search for the html tag <hr size="10">. To build the regular expression, we need to have<, then HR followed by one or more spaces(allowed as per the semantics of html), followed by zero or more spaces, followed by = followed by zero or more spaces the number 10 , again zero or more spaces followed by >. Based on our discussion so far, this should be simple, Now lets try and generalise this to any size, not just 10, so the regular expression would look like For case insensitive search using egrep, we can use the -i option.

Exercise -- Write a regular expression for the same examplem as above, where the size attribute is optional e.i a simple

is a legal match.

Back References -- Many tools provide the ability for parentheses to remember text matched by the sub-expression they invoke. So as an example, ([a-z])([0-9])\1\2, \1 refers to the text matched by [a-z] and \2 refers text matched by [0-9]. We will see examples of using back-referencing in following posts.
Escapting Meta-characters -- In order to escape meta-characters, we can use \ to escape the meta-character. For example if we tried to match att.com, it could end up matching something like watt company. So we need to escape the meta-character . , this can be done using att\.com. This escapes the meta-character and treats the meta-character . like a normal character

This covers some of the most important meta-characters. In future posts i would expand on these fundamentals and show how applying regular expressions makes life simpler.

Friday, June 20, 2008

Regular Expressions Part 1

I have been a great fan of using regular expressions for solving day to day tasks. I thought it was time to post a series for discussing the use and syntax of regular expressions. In this post i will talk about some simple meta-characters and their uses with tools like egrep/grep. In addition i will try and set some small exercises to practice the understanding of the readers.

A regular expression is composed of a set of meta-characters like *, ^ etc and a set of literal characters like A-Z, a-z, 0-9 etc.

Some Simple Meta-characters
Let us start with a some simple meta-characters and how we can apply these to various scenario's. The

^ - The Caret/Anchor is one of the simplest meta-characters and represents the start of the line which is being checked. For example ^cat would match all lines which start with the literal cat like catalog
$ - The dollar meta-character represents the end of the line which is being checked. So cat$ would match all lines which end with the literal characters cat, for example scat.

These 2 meta-characters are unique in the sense that they match a position rather than the actual literal characters themselves.

Exercise -- What would ^cat$, ^$ and ^ match ?

Character Class -- The meta-character [...] called the character class lets you list the characters you want to allow at that point of match. For example e matches just e , a matches just a, however [ea] would match both e or a. Thus the implied meaning is that of OR. Suppose you wanted to check whether you have spelt grey or gray. The following regular expression would match the occurance of both gr[ea]y. This matches g follows by r followed by either e or a followed by y. The - operator within a character class matches a range, so [0-9A-Z] matches occurances of numbers and capital letters. For example ] would help you while searching headers in html documents. The - is special inside a character class, otherwise it is used as a normal character and not as a range.
Negated Character classes -- A ^ inside a character class negates the matches. For example [^a-z] would match any character which is not a through z. Remember a negated character class means match a character thats not listed and not don't match whats listed

Exercise -- Figure out all words in a dictionary which contain a q not followed by a u

Matching Any character with dot -- this is simpler to explain with an example. Suppose you want to find a date 26/02/1977, which can also be represented as 26-02-1977 or even 26.02.1977. Not the . character can be used to match any character. So seemingly, 26.02.1977 should return us the date however it be represented, maybe with a - or a /. However, the above would also match a random number like 12264023197734. This is because the dot character matches any character. The correct way to do so would be to use something like 26[-./]02[-./]1977 (Note, the - is this case would not be a meta-character, since it immediately follows the [, if the same was written as [.-/], then it would become the range meta-character, also the . is not treated as a meta-character since it is inside the character class)
Matching any one of several sub-expressions -- The | (OR) operator lets you combine expressions into a single expression, that matches any of the individual ones. For example, based on the previous examples, we can rewrite gr[ea]y as gr(e|a)y. (Note-- we cannot use something like gr[e|a]y as inside the character class, | is treated as a normal character). We need the parenthesis because gre|ay would mean gre or ay.

Exercises 1) How does gr[ea]y differ from gr(e|a)y ?
2) What is the difference between ^:List|Of|Books: from ^(List|Of|Books): ?
Note -- Tools like egrep provide switches for case-insensitive searches. For egrep -i performs a case insensitive search.
Example -- egrep -i '^(Subject|Predicate):' myfile.txt
This serves as a good groundwork for someone wanting to start using regular expressions. I will be following up the post with detailed topics. Till then, cheers and keep the faith.

Thursday, June 19, 2008

The Art of analysis and problem solving

I have been wondering about the art of analysis and problem solving and decided to do a writeup using an example. A lot of developers tend to jump to writing code and never seem to bother about thinking out their solutions. For them if it works its fine. Moving away from this is a pre-requisite for anyone who wants to hone his development skills. In order to convey my point, i take up a seemingly simple example which displays how to analyze a solution before implementing it.

Problem Statement

Write a function which removes a set of characters from a given String. The signature of the function should be as follows

public String removeCharsFromString(String source, String remove)

Problem Analysis
The problem involves the following 2 steps
1) For each character in source, determine whether it needs to be deleted
2) Delete the character

Lets start with the actual deletion process first. You have to remove an element from an array. An array is a contiguous block of memory, so you cant remove an element from the middle as can be done with a linked list. So you have to rearrange the elements to maintain it as a contiguous block. For example if you need to remove c from an array containing abcd , you have to rearrange the elements. So you can either shift ab up or shift d down. In addition you need to decrease the size of the string by one.

Now how would the proposed algorithm work if you had to delete all elements from the source string. If the String is n characters long, this would mean shifting the last element by n-1, the second last by n-2 and so on, giving the worst time ass O(n^2).

How can we avoid this. What if you allocated a temporary string buffer and built your modified string there instead of in place? Then you could simply copy the characters you need to keep into the temporary string, skipping the characters you want to delete. When you finish building the modified string, you can copy it from the temporary buffer back into str. This way, you move each character at most twice, giving O(n) deletion. However, you’ve incurred the memory overhead of a temporary buffer the same size as the original string, and the time overhead of copying the modified string back over the original string. Is there any way you could avoid these penalties while retaining your O(n) algorithm?

To implement the O(n) algorithm just described, you need to track a source position for the read location in the original string and a destination position for the write position in the temporary buffer. These positions both start at zero. The source position is incremented every time you read, and the destination position is incremented every time you write. In other words, when you copy a character you increment both positions, but when you delete a character you increment only the source position. This means the source position will always be the same as or ahead of the destination position. Once you read a character from the original string (that is, the source position has advanced past it), you no longer need that character - in fact, you’re just going to copy the modified string over it. Because the destination position in the original string is always a character you don’t need anymore, you can write directly into the original string, eliminating the temporary buffer entirely. This is still an O(n) algorithm, but without the memory and time overhead of the earlier version.

Now to the actual part of deciding whether a character needs to be deleted. The easiest way is to compare all characters in the remove string with all the characters in the source string. If the size of the source string is n and the size of the remove string is m, this entails a running time of O(nm). How can we reduce this to a running time of O(m).

If we construct an array or a hashtable which has a constant time lookup, we can reduce the running time to O(m).
Lets construct an array containing all Booleans which is indexed by all the possible values of the characters in the source string. This enables you to determine whether a character is in remove by checking a single element.

Now the function will have 3 parts

Set all elements in the lookup array to false
Iterate through each element in remove, setting the corresponding value in the lookup array to true
Iterate through str with a source and destination index, copying each character only if its corresponding value in the lookup array is false

Now if we combine all the operations, the running time would be O(n+m) which is still linear.

Now lets take a look at the code

public String removeCharOccurances(String source,String remove){
char[] sourceArr = source.toCharArray();
char[] removeArr = remove.toCharArray();
boolean[] flags = new boolean[128]; // assumes ASCII!
int len = sourceArr.length;
int len1=removeArr.length;
int src, dst;

// Set flags for characters to be removed
for( src = 0; src < len1; ++src ){
flags[removeArr[src]] = true;
}
src = 0;
dst = 0;

// Now loop through all the characters,
// copying only if they aren't flagged
while( src < len ){
if( !flags[ (int)sourceArr[src] ] ){
sourceArr[dst++] = sourceArr[src];
}
++src;
}

return new String( sourceArr, 0, dst );
}

Wednesday, June 18, 2008

REST-ful services using JAX-WS 2.0

REST style services conform to a set of constraints and architectural principles

RESTful services are stateless. Each request from a client to a server must contain all the information necessary to understand the request and must not depend on any stored contextual information on the server.
REST-ful services have a uniform interface. For example for a RESTful HTTP service, the operation supported are GET, POST, PUT and DELETE.
REST based architectures are built from resources which are identified by unique URI's.
REST components manipulate resources by exchanging representations of the resources.

The fundamental differences between REST and RPC based services can be outlined as

RPC services work by invoking a procedure on a server rather than by exchanging representations of a resource.
In the RPC approach typically many operations are invoked at the same URI, which is fundamentally different from the REST way where every resource has a unique URI.

RESTful vs SOAP based services

Message Format

Interface Definition

Transport


XML	XML inside a SOAP envelope
none	WSDL
HTTP	HTTP,FTP,MIME,JMS,SMTP etc

In order to consume a RESFful service using JAX-WS 2.0, the majority of the work needs to be done using the javax.xml.ws.Dispatch interface. The javax.ws.xml.Service interface is the abstraction which provides the client interface to a web service. Since RESTful services do not have WSDL representations, the way to deal is a bit awkward, however that is the way JAX-WS has been designed. We use the Service interface to create an instance of the javax.xml.ws.Dispatch interface. Dispatch is a low level API which allows users to create messages by working directly with XML rather than using high level API's like JAXB etc.

The following code displays the basics and is discussed below

QName svcQName = new QName("http://xyz", "svc");
QName portQName = new QName("http://xyz", "port");
Service svc = Service.create(svcQName);
svc.addPort(portQName, HTTPBinding.HTTP_BINDING, args[0]);
Dispatch dis =
svc.createDispatch(portQName, Source.class, Service.Mode.PAYLOAD);
Map requestContext = dis.getRequestContext();
requestContext.put(MessageContext.HTTP_REQUEST_METHOD, "GET");
Source result = dis.invoke(null);
try {
TransformerFactory.newInstance().newTransformer()
.transform(result, new StreamResult(System.out));
} catch (Exception e) {
throw new IOException(e.getMessage());
}
The web service url is as a command line argument(arg[0])

The client uses the Service.addPort() method to create a port within the Service instance that can be used to access the RESTful Web service.
Next, the Service.createDispatch() method is invoked to create an instance of Dispatch—a Dispatch instance that enables you to work with XML request/response messages as instances of javax.xml.transform.Source.
The Dispatch.invoke() method then packages the XML request—per the JAX-WS 2.0 HTTP Binding—and sends it to the RESTful service. The invoke() method waits for the response before returning.
The service processes the HTTP GET and sends an HTTP response that includes the XML.
The invoke() method returns the response XML message as an instance of Source.

The addPort() method takes a URI parameter that defines the transport binding. The default is SOAP over HTTP but for RESTful services we use the HTTPBinding.HTTP_BINDING. The type of payload is specified in the createDispatch() method. Generally when passing SOAP messages, Service.Mode.MESSAGE is used to indicate that you want to work with the complete SOAP envelope as opposed to just the SOAP body(PAYLOAD). In case of RESTful services since there is no enveloper, PAYLOAD mode is used.

Thursday, June 12, 2008

Pair Programming

Kick-start the day by writing a unit test, just to get the grey cells working

Getting down to serious stuff.

This isnt making a lot of sense, is it.

Getting some tips from the senior partner

See that wasn't so difficult was it.

Tuesday, June 3, 2008

Light weight java game library

Being a server side guy, i decided it was time to do a bit of game programming as a challenge. I have started looking at LWJGL (Light Weight Java Game Library) which can interface with libraries like OpenGL and OpenAL. Still need to come up with a concept to start using it for developing a prototype, but time i did something different.

Cogito Ergo Sum