Semi OT: Uniquely Identifying Substrings for an Elem in a Set: substr, Sets and Complexity

Veli-Pekka Tätilä · Aug 21, 2005

Hi,
Here's a text processing problem that came to mind based on finding a
certain set of sub strings. As I came up with this one on my own
incidentally, I don't know any real good words for Googling useful
information or solutions to the problem and thought I'd ask here then,
having seen a number of basic string processing threads during my short-time
presence.

The Problem:

Suppose we have a set S of character strings. Each string in the set is
unique (well sets are), may be of arbitrary length and is composed of some
known alphabet (e.g. a to z). The task is to write a function that takes an
element E in S and returns all sub strings of E that uniquely identify E in
S. Put another way, all the returned strings are sub strings of E but they
may *not* be sub strings of any other element in S.

Some Thoughts:

Upon closer inspection, I also found out that the set of sub strings may
contain an element that is equal to E if E differs from some other string in
S by only one character. In that case, E as a whole is the only substring
identifying E.

Now, I must confess the problem is not inherently related to Perl in any way
thus the Semi OT tag in the subject-line. Perl would be the language of
choice for implementation personally, however, and thus some questions about
that, too.

There are bound to be better methods than the obvious brute-force tac of
trying everything. Howabout the string matching, is substr the way to go or
am I better off trying to hack regular expressions to do the job? If I'm
aiming at low temporal complexity, what would be the worst case complexity
for this problem? Surely not O(n). Performance is not an issue but I'm just
curious and not guru enough in math or computer science to figure it out
myself.

I'm certain the solution is relatively simple. I didn't find anything that
useful in the Perl Cookbook. The problem is not exactly a straight
intersection, difference etc... of two arrays but rather the set will
include elements that are not directly in the array being processed, though
they are based on an element of the array.

Motivation:

By the way, the thing that got me into pondering this problem in the first
place is somewhat related. I was writing a real crude mini-shell for a
textmode app of mine and in one command the user had to choose a very long
element (voice name) from a long list. To make the process easier, I allowed
selection based on a partial match provided that only one element contained
the sub string in question. Otherwise the matching elements would be listed.
Then I started to think if I could generate all the strings that select a
given element programatically, read lazily, and here's this post you're
reading.

PS: I had a hard time titling this post appropriately. Feel free to change
the subject to something more descriptive.

Veli-Pekka Tätilä · Aug 21, 2005

Hi again,
First mistake spotted. When talking about the perl functions I naturally
ment index rather than substr. I think this does come down to my English
skills and the semantics. Sub string means any part of a larger original
string. yet the substr function won't tell if a given substring is found but
rather extracts a substring based on character indeces. Maybe the names
cutString, piece or contains would be more descriptive. Well, my bad as
usual.

PS: Please reply to my original message in most cases. If someone wants to
discuss Perl's naming further, do at least change the subject.

With kind regards Veli-Pekka Tätilä ([email protected])
Accessibility, game music, synthesizers and programming:
http://www.student.oulu.fi/~vtatila/

xhoster · Aug 21, 2005

Veli-Pekka Tätilä said:
Hi,
Here's a text processing problem that came to mind based on finding a
certain set of sub strings. As I came up with this one on my own
incidentally, I don't know any real good words for Googling useful
information or solutions to the problem and thought I'd ask here then,
having seen a number of basic string processing threads during my
short-time presence.

The Problem:

Suppose we have a set S of character strings. Each string in the set is
unique (well sets are), may be of arbitrary length and is composed of
some known alphabet (e.g. a to z). The task is to write a function that
takes an element E in S and returns all sub strings of E that uniquely
identify E in S. Put another way, all the returned strings are sub
strings of E but they may *not* be sub strings of any other element in S.

This was discussed here two months ago, and someone claimed Text::Ngrams
was the/a solution. I don't whether it actually is or not, but you may
want to look up the original thread.

Upon closer inspection, I also found out that the set of sub strings may
contain an element that is equal to E if E differs from some other string
in S by only one character. In that case, E as a whole is the only
substring identifying E.

What if E itself is a substring of some other element of S? Then not even
E as a whole identifies E!

Xho

Tassilo v. Parseval · Aug 21, 2005

Also sprach Veli-Pekka Tätilä:

Hi,
Here's a text processing problem that came to mind based on finding a
certain set of sub strings. As I came up with this one on my own
incidentally, I don't know any real good words for Googling useful
information or solutions to the problem and thought I'd ask here then,
having seen a number of basic string processing threads during my short-time
presence.

The Problem:

Suppose we have a set S of character strings. Each string in the set is
unique (well sets are), may be of arbitrary length and is composed of some
known alphabet (e.g. a to z). The task is to write a function that takes an
element E in S and returns all sub strings of E that uniquely identify E in
S. Put another way, all the returned strings are sub strings of E but they
may *not* be sub strings of any other element in S.

Some Thoughts:

Upon closer inspection, I also found out that the set of sub strings may
contain an element that is equal to E if E differs from some other string in
S by only one character. In that case, E as a whole is the only substring
identifying E.

Now, I must confess the problem is not inherently related to Perl in any way
thus the Semi OT tag in the subject-line. Perl would be the language of
choice for implementation personally, however, and thus some questions about
that, too.

There are bound to be better methods than the obvious brute-force tac of
trying everything. Howabout the string matching, is substr the way to go or
am I better off trying to hack regular expressions to do the job? If I'm
aiming at low temporal complexity, what would be the worst case complexity
for this problem? Surely not O(n). Performance is not an issue but I'm just
curious and not guru enough in math or computer science to figure it out
myself.

With a set of $n words, I would assume the complexity of a brute-force
approach to be

O( $n * ($n-1) * ($avg * ($avg-1) / 2) )

where $avg is the average word length, the term ($avg * ($avg-1) / 2)
being the number of substrings for an average word and ($n-1) the number
of words to test each substring against. As the average number of
substrings doesn't depend on the size of the set, it can be assumed to
be some constant in which case the algorithm is in O($n**2) for all
words and subsequently O($n) for retrieving the unique substrings for
one word.

Note that the question whether to use a pattern match or index() or
whatever doesn't really change the algorithm. In the one below I used
index() but could have done it just as easily using a pattern match.

Here's an implementation of such a brute-force algorithm which gave me
the opportunity to make use of Eric J. Roode's Iterator module recently
uploaded to the CPAN:

#!/usr/bin/perl -w

use strict;
use Iterator;
use List::MoreUtils qw/all/;

chomp( my @words = <DATA> );

for (@words) {
print "Unique ids for $_:\n ";
print join ", ", uniq_id($_, \@words);
print "\n";
}

sub uniq_id {
my ($word, $set) = @_;
my $iter = do {
my ($len, $off) = (1, 0);
my $i = Iterator->new(
sub {
Iterator::is_done() if $len > length $word;
my $sub = substr $word, $off++, $len;
($len, $off) = ($len + 1, 0)
if $off + $len > length $word;
return $sub;
}
);
$i;
};
my @ids;
while ($iter->isnt_exhausted) {
my $pat = $iter->value;
push @ids, $pat
if all { index($_, $pat) == -1 } grep { $_ ne $word } @$set;
}
return @ids;
}

__DATA__
abc
bcd
foobar
foo
bar
pearl
perl
modperl

One optimization immediately springs to my mind (unfortunately after
I've written the above):

The iterator returns the substrings in ascending order length-wise.
However, it's always so that if there is no unique substring of length
$m, there cannot possibly be one of $m-1. Therefore, have the iterator
return the substrings in descending order and rewrite the 'all { } grep
{ } @$set' condition into a real loop that exits early as soon as none
of the substrings of a given length are unique.

This does not really change the complexity of the algorithm as it only
makes the constant ($avg * ($avg-1) / 2) smaller. For small sets though
this constant is the dominant parameter.

Motivation:

By the way, the thing that got me into pondering this problem in the first
place is somewhat related. I was writing a real crude mini-shell for a
textmode app of mine and in one command the user had to choose a very long
element (voice name) from a long list. To make the process easier, I allowed
selection based on a partial match provided that only one element contained
the sub string in question. Otherwise the matching elements would be listed.
Then I started to think if I could generate all the strings that select a
given element programatically, read lazily, and here's this post you're
reading.

Something like TAB-completion offered by many shells? That would
decrease the complexity somewhat as you'd only have to test substrings
with offset zero (that is, at the beginning of the word).

PS: I had a hard time titling this post appropriately. Feel free to change
the subject to something more descriptive.

Your Subject was fine. You can't normally squeeze the whole content of a
posting into the Subject. If that was possible, postings wouldn't need a
body and we'd all communicate here by exchanging Subjects.

Tassilo

Anno Siegel · Aug 22, 2005

Veli-Pekka Tätilä said:
Hi,
Here's a text processing problem that came to mind based on finding a
certain set of sub strings. As I came up with this one on my own
incidentally, I don't know any real good words for Googling useful
information or solutions to the problem and thought I'd ask here then,
having seen a number of basic string processing threads during my short-time
presence.

The Problem:

Suppose we have a set S of character strings. Each string in the set is
unique (well sets are), may be of arbitrary length and is composed of some
known alphabet (e.g. a to z). The task is to write a function that takes an
element E in S and returns all sub strings of E that uniquely identify E in
S. Put another way, all the returned strings are sub strings of E but they
may *not* be sub strings of any other element in S.

Here is a brute-force solution. Generate all substrings of all strings
in the set and remember the string each substring was generated from.
(That happens in the hash %coll.) Then the substrings that were generated
from exactly one string are the ones you're looking for.

If you have to find solutions to your problem repeatedly (with the same
basic set of strings), my solution may even be practical. Its main
virtue is that the algortithm is so simple, it is almost certainly correct.

With a bow to Tassilo for providing a good example set:

#!/usr/bin/perl
use strict; use warnings; $| = 1;

my %coll;
for my $str ( <DATA> ) {
chomp $str;
push @{ $coll{ $_}}, $str for substrings( $str);
}

# postprocessing for nice output
my %abbrev;
@{ $coll{ $_}} == 1 and push @{ $abbrev{ $coll{ $_}->[ 0]}}, $_ for
keys %coll;

print "$_ -> @{ $abbrev{ $_}}\n" for sort keys %abbrev;

#################################################################

sub substrings {
my $str = shift;
map substr_n( $str, $_), 0 .. length $str;
}

sub substr_n {
my ($str, $l) = @_;
map substr( $str, $_, $l), 0 .. length( $str) - $l;
}

__DATA__
abc
bcd
foobar
foo
bar
pearl
perl
modperl

Anno

xhoster · Aug 22, 2005

Here is a brute-force solution. Generate all substrings of all strings
in the set and remember the string each substring was generated from.
(That happens in the hash %coll.) Then the substrings that were
generated from exactly one string are the ones you're looking for.

If you have to find solutions to your problem repeatedly (with the same
basic set of strings), my solution may even be practical. Its main
virtue is that the algortithm is so simple, it is almost certainly
correct.

If you add 'mississippi' to the list, it misses 'iss' as a identifying
substring (and I suppose others), due to the internal duplication.

With a bow to Tassilo for providing a good example set:

#!/usr/bin/perl
use strict; use warnings; $| = 1;

my %coll;
for my $str ( <DATA> ) {
chomp $str;
push @{ $coll{ $_}}, $str for substrings( $str);
}

You should either use a hash of hash rather than a hash of array, (which
seems to me like it would be a more natural anyway) or change substrings so
that it only returns one copy of each substring for any given invocation.

Xho

Anno Siegel · Aug 23, 2005

If you add 'mississippi' to the list, it misses 'iss' as a identifying
substring (and I suppose others), due to the internal duplication.

Oh dear! Good thing I said "almost certainly".

[Advice snipped but incorporated below]

Anno

#!/usr/bin/perl
use strict; use warnings; $| = 1;

my %coll;
for my $str ( <DATA> ) {
chomp $str;
$coll{ $_}->{ $str} = 1 for substrings( $str);
}

# postprocessing for nice output
my %abbrev;

for ( keys %coll ) {
if ( ( my ( $str) = keys %{ $coll{ $_}}) == 1 ) {
push @{ $abbrev{ $str}}, $_;
}
}

@$_ = sort { length( $a) <=> length( $b) || $a cmp $b } @$_ for values %abbrev;

print "$_ -> @{ $abbrev{ $_}}\n" for sort keys %abbrev;

#################################################################

sub substrings {
my $str = shift;
map substr_n( $str, $_), 0 .. length $str;
}

sub substr_n {
my ($str, $l) = @_;
map substr( $str, $_, $l), 0 .. length( $str) - $l;
}

__DATA__
abc
bcd
foobar
foo
bar
pearl
perl
modperl
mississippi

Creating books with Sets	1	Sep 27, 2022
Time Complexity for substr() function	1	May 13, 2004
Set value after clicking in field	11	May 9, 2023
How Do I Set text on an Image and use the image as a border?	7	Mar 16, 2023
Converting an Array to a String in JavaScript	7	Sep 22, 2023
[semi OT]: Smartphones and Python?	23	Feb 15, 2012
SUBSTR() with replacement or lvalue performance issues	22	Nov 7, 2008
The distinction between a java applet and an application	1	Jan 4, 2023

Semi OT: Uniquely Identifying Substrings for an Elem in a Set: substr, Sets and Complexity

Veli-Pekka Tätilä

Veli-Pekka Tätilä

xhoster

Tassilo v. Parseval

Anno Siegel

xhoster

Anno Siegel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads