Coders Coffee Corner: The intricacies of strstr

the deceptively simply sounding and oft-used function strstr has more than what meets the eye.

How many times is that you have searched a web-page/word document/filesystem for mactching strings? Probably more than once a day. The function that does string matching for you is strstr

This has the signature
char* strstr(const char* text , const char* pattern_tomatch)

It returns a pointer to the located string, else a null pointer.

Lets go about implementing this function.

The first and most trivial implementation has a O(n2) complexity

Heres the code for this

char* strstr(const char* s1 , const char* s2){
// set tmp pointers to intial location
char* tmp = s2;
char* tmp2 = s1;

// null and empty checks
if (0 == s2) return s1;
// returns s1 if s2 is ""
if (0 == *s2) return s1;
if (0 == s1) return 0;
if (0 == *s1) return 0;

while(*s1){
if (*s1 == *s2){
if (0 == *(++s2)){
return (s1-(s2-tmp-1));
}
}
else {
s2 = tmp;
s1 = ++tmp2;
if (*s1 == *s2) s2++;
}
s1++;
}
return 0;
}

In order to simplify the above "obfuscated" code and to prove its O(n2), lets rewrite using a for loop

char* strstr(const char* s1, const char* s2){

//skipping validations

// bad practise, size of s1 and/or s2 can be > long
unsigned long len_s1 = strlen(s1);
unsigned long len_s2 = strlen(s2);

unsigned long j = 0;
for (unsigned long i=0;i
if (s1[i] == s2[j]){
for (j=1;j if (s1[i+j] != s2[j])
break;
}
}
if (j == len_s2+1) return s1+i+(j-len_s2);
}

return (char*)0;
}

I always prefer not using ints/longs as they are always susceptible to overflows and lots of of (consequent) security bugs

I will now go about writing developing an algortihm for speeding up the whole process.

The most common thought that comes to the mind is to somehow use a hash table (hash table is logical to me because the problem involves a lookup of individual charaters)
Hash tables arent suitable for this problem because hashes dont store ordering information.
For eg,
aaxbaa , ab are to strstr'ed
a hash table would like
key | # occurences
a 4
b 1
x 1
the string to be matched "ab", will match the input str because both a and b exist in the hash table.

Obviously, this approach is no good. There should be a means to store the place of occurence of a character and not the number of occurences. So a hash table can look like this

key | linked list of positions
a 0->1->4->5
b 3
x 2

Now the algorithm can be modified to do look for s2 ("ab" in our case) such that the lookups are in increasing, consecutive order.

so, a look of "a" will yield 0 and b will yield 2. o and 2 are in consecutive, so its not a match.
Suppose, we look fo "xb", lookup for x yields 2 and lookup for b yields 3.
So (2,3) is an increasing, consecutive order and we have a match.

Now, what if s2 had many occurences of a character. In order to scale, the lookup should be engineered to take this into account.....
to be contd(with other possible heuristics) ......

Coders Coffee Corner

Sunday, April 03, 2005

The intricacies of strstr

0 Comments:

About

About Me

Previous