R
Roy Smith
There's been a bunch of threads lately about string implementations, and
that got me thinking (which is often a dangerous thing).
Let's assume you're testing two strings for equality. You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.
I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning? If the strings are indeed equal, it's the
same amount of work starting from either end. But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.
It doesn't seem un-plausible that this is the case. For example, most
of the filenames I work with begin with "/home/roy/". Most of the
strings I use as memcache keys have one of a small number of prefixes.
Most of the strings I use as logger names have common leading
substrings. Things like credit card and telephone numbers tend to have
much more entropy in the trailing digits.
On the other hand, hostnames (and thus email addresses) exhibit the
opposite pattern.
Anyway, it's just a thought. Has anybody studied this for real-life
usage patterns?
I'm also not sure how this work with all the possible UCS/UTF encodings.
With some of them, you may get the encoding semantics wrong if you don't
start from the front.
that got me thinking (which is often a dangerous thing).
Let's assume you're testing two strings for equality. You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.
I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning? If the strings are indeed equal, it's the
same amount of work starting from either end. But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.
It doesn't seem un-plausible that this is the case. For example, most
of the filenames I work with begin with "/home/roy/". Most of the
strings I use as memcache keys have one of a small number of prefixes.
Most of the strings I use as logger names have common leading
substrings. Things like credit card and telephone numbers tend to have
much more entropy in the trailing digits.
On the other hand, hostnames (and thus email addresses) exhibit the
opposite pattern.
Anyway, it's just a thought. Has anybody studied this for real-life
usage patterns?
I'm also not sure how this work with all the possible UCS/UTF encodings.
With some of them, you may get the encoding semantics wrong if you don't
start from the front.