Fastest way to remove common substrings from a list of strings

F

freesoft12

Hi,

Given a list containing various UNIX directory paths:

@list = qw(
/home/bart
/home/bart/foo
/usr/lib/
/home/bart/foo/xyz
/home/bart/foo/xyz/live.cpp )

I want to trim this list to contain:

@list = qw(
/usr/lib/
/home/bart/foo/xyz/live.cpp )

Is there a better way than this possibly n^2 algorithm below.

my @to_be_removed_indices = ();
for (my $index=0; $index < #list; ++$index) {
my $current_item = $list[$index];
++$index;
for (my $j = $index; $j < $#list; ++$j) {
if ($current_item =~ m/$list[$j]/) { # is the current item a
substring of another
push (@to_be_removed_indices, $index); # schedule its removal
last; # break
}
}
# Remove matched items

Regards
John
 
M

Mirco Wahab

Hi,

Given a list containing various UNIX directory paths:

@list = qw(
/home/bart
/home/bart/foo
/usr/lib/
/home/bart/foo/xyz
/home/bart/foo/xyz/live.cpp )

I want to trim this list to contain:

@list = qw(
/usr/lib/
/home/bart/foo/xyz/live.cpp )

Is there a better way than this possibly n^2 algorithm below.

Still pseudo O(2), but much shorter:

...
foreach my $k (@list) {
$k = undef if grep defined && /^$k./, @list;
}

print join "\n", grep defined, @list;
...

Regards

M.
 
D

Dave B

Hi,

Given a list containing various UNIX directory paths:

@list = qw(
/home/bart
/home/bart/foo
/usr/lib/
/home/bart/foo/xyz
/home/bart/foo/xyz/live.cpp )

I want to trim this list to contain:

@list = qw(
/usr/lib/
/home/bart/foo/xyz/live.cpp )

Is there a better way than this possibly n^2 algorithm below.

my @to_be_removed_indices = ();
for (my $index=0; $index < #list; ++$index) {
my $current_item = $list[$index];
++$index;
for (my $j = $index; $j < $#list; ++$j) {
if ($current_item =~ m/$list[$j]/) { # is the current item a
substring of another
push (@to_be_removed_indices, $index); # schedule its removal
last; # break
}
}
# Remove matched items

If your condition is "remove a string if it's a substring of another,
matching *at the beginning*", then the following code seems to work:

my @list = sort qw(
/home/bart
/home/bart/foo
/usr/lib/
/home/bart/foo/xyz
/home/bart/foo/xyz/live.cpp );

my $s=$list[0];
my @newlist;

for (@list, substr($list[-1],1)) {
push(@newlist,$s) unless /$s/;
$s=$_;
}

print "@newlist\n";

# now @newlist has the values that are not
# substring of anything else

I hope I understood the problem correctly.
 
T

Ted Zlatanov

fc> Given a list containing various UNIX directory paths:

fc> @list = qw(
fc> /home/bart
fc> /home/bart/foo
fc> /usr/lib/
fc> /home/bart/foo/xyz
fc> /home/bart/foo/xyz/live.cpp )

fc> I want to trim this list to contain:

fc> @list = qw(
fc> /usr/lib/
fc> /home/bart/foo/xyz/live.cpp )

fc> Is there a better way than this possibly n^2 algorithm below.

fc> my @to_be_removed_indices = ();
fc> for (my $index=0; $index < #list; ++$index) {
fc> my $current_item = $list[$index];
fc> ++$index;
fc> for (my $j = $index; $j < $#list; ++$j) {
fc> if ($current_item =~ m/$list[$j]/) { # is the current item a
fc> substring of another
fc> push (@to_be_removed_indices, $index); # schedule its removal
fc> last; # break
fc> }
fc> }
fc> # Remove matched items

If @to_be_removed_indices has less than 5 elements, don't worry about
it, your complexity is too low to matter. Even 10 elements won't make a
big difference. Complexity measurements matter when you get into
thousands and millions of items ($disclaimers_go_here).

Otherwise, you can do many optimizations. First one I see is that you
should use index() instead of a match if you want to check for
substrings. Another is to find the ASCII characters not in
@to_be_removed_indices, let's say they are qw/q w p/, and then exclude
any string that doesn't have those characters. You can also do a trie
of your search strings, or build an regex alternation of them and hope
the regex engine can optimize it for you. The general idea is to
pre-process as much as possible, and let Perl use index() instead of
regular expressions when possible.

Ted
 
M

Mirco Wahab

Dave said:
This is probably better and safer:

push(@newlist,$s) unless /^\Q$s\E/;

You won't need all that if you start from an already *sorted* list,
a single map will do:

my @result = map { $list[$_]=~/^\Q$list[$_-1]\E./ ? () : $list[$_] } 1..$#list;

Regards

M.
 
D

Dave B

Mirco said:
This is probably better and safer:

push(@newlist,$s) unless /^\Q$s\E/;

You won't need all that if you start from an already *sorted* list,
a single map will do:

my @result = map { $list[$_]=~/^\Q$list[$_-1]\E./ ? () : $list[$_] } 1..$#list;

I agree that using map is more Perl-style, but the above statement alone
doesn't work. You need at least this:

my @result=map{$list[$_]=~/^\Q$list[$_-1]\E./ ?():$list[$_-1] } 1..$#list;
push @result, $list[-1];
 
X

xhoster

Hi,

Given a list containing various UNIX directory paths:

@list = qw(
/home/bart
/home/bart/foo
/usr/lib/
/home/bart/foo/xyz
/home/bart/foo/xyz/live.cpp )

I want to trim this list to contain:

@list = qw(
/usr/lib/
/home/bart/foo/xyz/live.cpp )


That isn't a complete enough specification. What if there is a
/foo/home/bart? How about a /home/bartzilla?

Is there a better way than this possibly n^2 algorithm below.

Almost anything would be better than the below. It doesn't compile.
And once #list is turned to $#list, it does compile but leaves
@to_be_removed_indices empty, so it doesn't what you say you want done.
If you are happy with a wrong answer, then there are much better ways than
n^2 to achieve that.
my @to_be_removed_indices = ();
for (my $index=0; $index < #list; ++$index) {
my $current_item = $list[$index];
++$index;

You are incrementing ++$index twice, so only even numbered indexes are
being "processed".

etc.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
M

Mirco Wahab

Dave said:
Mirco said:
my @result = map { $list[$_]=~/^\Q$list[$_-1]\E./ ? () : $list[$_] } 1..$#list;

I agree that using map is more Perl-style, but the above statement alone
doesn't work. You need at least this:

my @result=map{$list[$_]=~/^\Q$list[$_-1]\E./ ?():$list[$_-1] } 1..$#list;
push @result, $list[-1];

Yes, you are right. I saw the mistake and corrected it (tried to):

my @result =
(
(map { $list[$_+1] =~ /^\Q$list[$_]\E./ ? () : $list[$_] } 0..$#list-1),
$list[-1]
);

but you have been faster. The last element was missing.

Thanks & Regards

M.
 
D

Dave B

Leon said:
You should really quotemeta $s or else weird things can happen if it
contains regexp meta characters. Better yet I would write that line as:

push(@newlist, $s) if substr($_, 0, length $s) ne $s.

Since that doesn't waste time looking for a substring if it isn't at the
start.

I fixed both issues in a subsequent post.
 
C

comp.llang.perl.moderated

This is probably better and safer:

push(@newlist,$s) unless /^\Q$s\E/;


here's a shorter (and tad more obfuscated)
variant:

my @list = sort qw( ... );
my ($s, @newlist);
for (@list, undef) {
push(@newlist,$s) unless /\Q$s\E/;
$s=$_;
}
 
S

szr

Leon said:
You should really quotemeta $s or else weird things can happen if it
contains regexp meta characters. Better yet I would write that line
as:

push(@newlist, $s) if substr($_, 0, length $s) ne $s.

Since that doesn't waste time looking for a substring if it isn't at
the start.

Or he just change that regex from /$s/ to /^$s/ to also achor it at the
beginning. I think the substr might be ever so slightly faster though.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,737
Latest member
Georgeengab

Latest Threads

Top