Coders Coffee Corner

Sunday, April 03, 2005

The intricacies of strstr

the deceptively simply sounding and oft-used function strstr has more than what meets the eye.

How many times is that you have searched a web-page/word document/filesystem for mactching strings? Probably more than once a day. The function that does string matching for you is strstr

This has the signature
char* strstr(const char* text , const char* pattern_tomatch)

It returns a pointer to the located string, else a null pointer.

Lets go about implementing this function.

The first and most trivial implementation has a O(n2) complexity


Heres the code for this

char* strstr(const char* s1 , const char* s2){
// set tmp pointers to intial location
char* tmp = s2;
char* tmp2 = s1;

// null and empty checks
if (0 == s2) return s1;
// returns s1 if s2 is ""
if (0 == *s2) return s1;
if (0 == s1) return 0;
if (0 == *s1) return 0;

while(*s1){
if (*s1 == *s2){
if (0 == *(++s2)){
return (s1-(s2-tmp-1));
}
}
else {
s2 = tmp;
s1 = ++tmp2;
if (*s1 == *s2) s2++;
}
s1++;
}
return 0;
}

In order to simplify the above "obfuscated" code and to prove its O(n2), lets rewrite using a for loop



char* strstr(const char* s1, const char* s2){

//skipping validations

// bad practise, size of s1 and/or s2 can be > long
unsigned long len_s1 = strlen(s1);
unsigned long len_s2 = strlen(s2);

unsigned long j = 0;
for (unsigned long i=0;i
if (s1[i] == s2[j]){
for (j=1;j if (s1[i+j] != s2[j])
break;
}
}
if (j == len_s2+1) return s1+i+(j-len_s2);
}

return (char*)0;
}



I always prefer not using ints/longs as they are always susceptible to overflows and lots of of (consequent) security bugs

I will now go about writing developing an algortihm for speeding up the whole process.

The most common thought that comes to the mind is to somehow use a hash table (hash table is logical to me because the problem involves a lookup of individual charaters)
Hash tables arent suitable for this problem because hashes dont store ordering information.
For eg,
aaxbaa , ab are to strstr'ed
a hash table would like
key | # occurences
a 4
b 1
x 1
the string to be matched "ab", will match the input str because both a and b exist in the hash table.

Obviously, this approach is no good. There should be a means to store the place of occurence of a character and not the number of occurences. So a hash table can look like this

key | linked list of positions
a 0->1->4->5
b 3
x 2

Now the algorithm can be modified to do look for s2 ("ab" in our case) such that the lookups are in increasing, consecutive order.

so, a look of "a" will yield 0 and b will yield 2. o and 2 are in consecutive, so its not a match.
Suppose, we look fo "xb", lookup for x yields 2 and lookup for b yields 3.
So (2,3) is an increasing, consecutive order and we have a match.

Now, what if s2 had many occurences of a character. In order to scale, the lookup should be engineered to take this into account.....
to be contd(with other possible heuristics) ......

0 Comments:

Post a Comment

<< Home