V
Veli-Pekka Tätilä
Hi,
Here's a text processing problem that came to mind based on finding a
certain set of sub strings. As I came up with this one on my own
incidentally, I don't know any real good words for Googling useful
information or solutions to the problem and thought I'd ask here then,
having seen a number of basic string processing threads during my short-time
presence.
The Problem:
Suppose we have a set S of character strings. Each string in the set is
unique (well sets are), may be of arbitrary length and is composed of some
known alphabet (e.g. a to z). The task is to write a function that takes an
element E in S and returns all sub strings of E that uniquely identify E in
S. Put another way, all the returned strings are sub strings of E but they
may *not* be sub strings of any other element in S.
Some Thoughts:
Upon closer inspection, I also found out that the set of sub strings may
contain an element that is equal to E if E differs from some other string in
S by only one character. In that case, E as a whole is the only substring
identifying E.
Now, I must confess the problem is not inherently related to Perl in any way
thus the Semi OT tag in the subject-line. Perl would be the language of
choice for implementation personally, however, and thus some questions about
that, too.
There are bound to be better methods than the obvious brute-force tac of
trying everything. Howabout the string matching, is substr the way to go or
am I better off trying to hack regular expressions to do the job? If I'm
aiming at low temporal complexity, what would be the worst case complexity
for this problem? Surely not O(n). Performance is not an issue but I'm just
curious and not guru enough in math or computer science to figure it out
myself.
I'm certain the solution is relatively simple. I didn't find anything that
useful in the Perl Cookbook. The problem is not exactly a straight
intersection, difference etc... of two arrays but rather the set will
include elements that are not directly in the array being processed, though
they are based on an element of the array.
Motivation:
By the way, the thing that got me into pondering this problem in the first
place is somewhat related. I was writing a real crude mini-shell for a
textmode app of mine and in one command the user had to choose a very long
element (voice name) from a long list. To make the process easier, I allowed
selection based on a partial match provided that only one element contained
the sub string in question. Otherwise the matching elements would be listed.
Then I started to think if I could generate all the strings that select a
given element programatically, read lazily, and here's this post you're
reading.
PS: I had a hard time titling this post appropriately. Feel free to change
the subject to something more descriptive.
Here's a text processing problem that came to mind based on finding a
certain set of sub strings. As I came up with this one on my own
incidentally, I don't know any real good words for Googling useful
information or solutions to the problem and thought I'd ask here then,
having seen a number of basic string processing threads during my short-time
presence.
The Problem:
Suppose we have a set S of character strings. Each string in the set is
unique (well sets are), may be of arbitrary length and is composed of some
known alphabet (e.g. a to z). The task is to write a function that takes an
element E in S and returns all sub strings of E that uniquely identify E in
S. Put another way, all the returned strings are sub strings of E but they
may *not* be sub strings of any other element in S.
Some Thoughts:
Upon closer inspection, I also found out that the set of sub strings may
contain an element that is equal to E if E differs from some other string in
S by only one character. In that case, E as a whole is the only substring
identifying E.
Now, I must confess the problem is not inherently related to Perl in any way
thus the Semi OT tag in the subject-line. Perl would be the language of
choice for implementation personally, however, and thus some questions about
that, too.
There are bound to be better methods than the obvious brute-force tac of
trying everything. Howabout the string matching, is substr the way to go or
am I better off trying to hack regular expressions to do the job? If I'm
aiming at low temporal complexity, what would be the worst case complexity
for this problem? Surely not O(n). Performance is not an issue but I'm just
curious and not guru enough in math or computer science to figure it out
myself.
I'm certain the solution is relatively simple. I didn't find anything that
useful in the Perl Cookbook. The problem is not exactly a straight
intersection, difference etc... of two arrays but rather the set will
include elements that are not directly in the array being processed, though
they are based on an element of the array.
Motivation:
By the way, the thing that got me into pondering this problem in the first
place is somewhat related. I was writing a real crude mini-shell for a
textmode app of mine and in one command the user had to choose a very long
element (voice name) from a long list. To make the process easier, I allowed
selection based on a partial match provided that only one element contained
the sub string in question. Otherwise the matching elements would be listed.
Then I started to think if I could generate all the strings that select a
given element programatically, read lazily, and here's this post you're
reading.
PS: I had a hard time titling this post appropriately. Feel free to change
the subject to something more descriptive.