Fastest way to remove common substrings from a list of strings

freesoft12 · Jul 15, 2008

Hi,

Given a list containing various UNIX directory paths:

@list = qw(
/home/bart
/home/bart/foo
/usr/lib/
/home/bart/foo/xyz
/home/bart/foo/xyz/live.cpp )

I want to trim this list to contain:

@list = qw(
/usr/lib/
/home/bart/foo/xyz/live.cpp )

Is there a better way than this possibly n^2 algorithm below.

my @to_be_removed_indices = ();
for (my $index=0; $index < #list; ++$index) {
my $current_item = $list[$index];
++$index;
for (my $j = $index; $j < $#list; ++$j) {
if ($current_item =~ m/$list[$j]/) { # is the current item a
substring of another
push (@to_be_removed_indices, $index); # schedule its removal
last; # break
}
}
# Remove matched items

Regards
John

Mirco Wahab · Jul 15, 2008

Hi,

Given a list containing various UNIX directory paths:

@list = qw(
/home/bart
/home/bart/foo
/usr/lib/
/home/bart/foo/xyz
/home/bart/foo/xyz/live.cpp )

I want to trim this list to contain:

@list = qw(
/usr/lib/
/home/bart/foo/xyz/live.cpp )

Is there a better way than this possibly n^2 algorithm below.

Still pseudo O(2), but much shorter:

...
foreach my $k (@list) {
$k = undef if grep defined && /^$k./, @list;
}

print join "\n", grep defined, @list;
...

Regards

M.

Dave B · Jul 15, 2008

Hi,

Given a list containing various UNIX directory paths:

@list = qw(
/home/bart
/home/bart/foo
/usr/lib/
/home/bart/foo/xyz
/home/bart/foo/xyz/live.cpp )

I want to trim this list to contain:

@list = qw(
/usr/lib/
/home/bart/foo/xyz/live.cpp )

Is there a better way than this possibly n^2 algorithm below.

my @to_be_removed_indices = ();
for (my $index=0; $index < #list; ++$index) {
my $current_item = $list[$index];
++$index;
for (my $j = $index; $j < $#list; ++$j) {
if ($current_item =~ m/$list[$j]/) { # is the current item a
substring of another
push (@to_be_removed_indices, $index); # schedule its removal
last; # break
}
}
# Remove matched items

If your condition is "remove a string if it's a substring of another,
matching *at the beginning*", then the following code seems to work:

my @list = sort qw(
/home/bart
/home/bart/foo
/usr/lib/
/home/bart/foo/xyz
/home/bart/foo/xyz/live.cpp );

my $s=$list[0];
my @newlist;

for (@list, substr($list[-1],1)) {
push(@newlist,$s) unless /$s/;
$s=$_;
}

print "@newlist\n";

# now @newlist has the values that are not
# substring of anything else

I hope I understood the problem correctly.

Dave B · Jul 15, 2008

Dave said:
push(@newlist,$s) unless /$s/;

This is probably better and safer:

push(@newlist,$s) unless /^\Q$s\E/;

Ted Zlatanov · Jul 15, 2008

fc> Given a list containing various UNIX directory paths:

fc> @list = qw(
fc> /home/bart
fc> /home/bart/foo
fc> /usr/lib/
fc> /home/bart/foo/xyz
fc> /home/bart/foo/xyz/live.cpp )

fc> I want to trim this list to contain:

fc> @list = qw(
fc> /usr/lib/
fc> /home/bart/foo/xyz/live.cpp )

fc> Is there a better way than this possibly n^2 algorithm below.

fc> my @to_be_removed_indices = ();
fc> for (my $index=0; $index < #list; ++$index) {
fc> my $current_item = $list[$index];
fc> ++$index;
fc> for (my $j = $index; $j < $#list; ++$j) {
fc> if ($current_item =~ m/$list[$j]/) { # is the current item a
fc> substring of another
fc> push (@to_be_removed_indices, $index); # schedule its removal
fc> last; # break
fc> }
fc> }
fc> # Remove matched items

If @to_be_removed_indices has less than 5 elements, don't worry about
it, your complexity is too low to matter. Even 10 elements won't make a
big difference. Complexity measurements matter when you get into
thousands and millions of items ($disclaimers_go_here).

Otherwise, you can do many optimizations. First one I see is that you
should use index() instead of a match if you want to check for
substrings. Another is to find the ASCII characters not in
@to_be_removed_indices, let's say they are qw/q w p/, and then exclude
any string that doesn't have those characters. You can also do a trie
of your search strings, or build an regex alternation of them and hope
the regex engine can optimize it for you. The general idea is to
pre-process as much as possible, and let Perl use index() instead of
regular expressions when possible.

Ted

Mirco Wahab · Jul 15, 2008

Dave said:
This is probably better and safer:

push(@newlist,$s) unless /^\Q$s\E/;

You won't need all that if you start from an already *sorted* list,
a single map will do:

my @result = map { $list[$_]=~/^\Q$list[$_-1]\E./ ? () : $list[$_] } 1..$#list;

Regards

M.

Dave B · Jul 15, 2008

Mirco said:
This is probably better and safer:

push(@newlist,$s) unless /^\Q$s\E/;

Click to expand...

You won't need all that if you start from an already *sorted* list,
a single map will do:

my @result = map { $list[$_]=~/^\Q$list[$_-1]\E./ ? () : $list[$_] } 1..$#list;

I agree that using map is more Perl-style, but the above statement alone
doesn't work. You need at least this:

my @result=map{$list[$_]=~/^\Q$list[$_-1]\E./ ?():$list[$_-1] } 1..$#list;
push @result, $list[-1];

xhoster · Jul 15, 2008

[email protected] said:
Hi,

Given a list containing various UNIX directory paths:

@list = qw(
/home/bart
/home/bart/foo
/usr/lib/
/home/bart/foo/xyz
/home/bart/foo/xyz/live.cpp )

I want to trim this list to contain:

@list = qw(
/usr/lib/
/home/bart/foo/xyz/live.cpp )

That isn't a complete enough specification. What if there is a
/foo/home/bart? How about a /home/bartzilla?

Is there a better way than this possibly n^2 algorithm below.

Almost anything would be better than the below. It doesn't compile.
And once #list is turned to $#list, it does compile but leaves
@to_be_removed_indices empty, so it doesn't what you say you want done.
If you are happy with a wrong answer, then there are much better ways than
n^2 to achieve that.

my @to_be_removed_indices = ();
for (my $index=0; $index < #list; ++$index) {
my $current_item = $list[$index];
++$index;

You are incrementing ++$index twice, so only even numbered indexes are
being "processed".

etc.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Mirco Wahab · Jul 15, 2008

Dave said:
Mirco said:

my @result = map { $list[$_]=~/^\Q$list[$_-1]\E./ ? () : $list[$_] } 1..$#list;

Click to expand...

I agree that using map is more Perl-style, but the above statement alone
doesn't work. You need at least this:

my @result=map{$list[$_]=~/^\Q$list[$_-1]\E./ ?():$list[$_-1] } 1..$#list;
push @result, $list[-1];

Yes, you are right. I saw the mistake and corrected it (tried to):

my @result =
(
(map { $list[$_+1] =~ /^\Q$list[$_]\E./ ? () : $list[$_] } 0..$#list-1),
$list[-1]
);

but you have been faster. The last element was missing.

Thanks & Regards

M.

Dave B · Jul 15, 2008

Leon said:
You should really quotemeta $s or else weird things can happen if it
contains regexp meta characters. Better yet I would write that line as:

push(@newlist, $s) if substr($_, 0, length $s) ne $s.

Since that doesn't waste time looking for a substring if it isn't at the
start.

I fixed both issues in a subsequent post.

comp.llang.perl.moderated · Jul 15, 2008

This is probably better and safer:

push(@newlist,$s) unless /^\Q$s\E/;

here's a shorter (and tad more obfuscated)
variant:

my @list = sort qw( ... );
my ($s, @newlist);
for (@list, undef) {
push(@newlist,$s) unless /\Q$s\E/;
$s=$_;
}

szr · Jul 15, 2008

Leon said:
You should really quotemeta $s or else weird things can happen if it
contains regexp meta characters. Better yet I would write that line
as:

push(@newlist, $s) if substr($_, 0, length $s) ne $s.

Since that doesn't waste time looking for a substring if it isn't at
the start.

Or he just change that regex from /$s/ to /^$s/ to also achor it at the
beginning. I think the substr might be ever so slightly faster though.

freesoft12 · Jul 16, 2008

thanks for your help! I will try out your suggestions!

Regards
John

Fastest way to detect a non-ASCII character in a list of strings.	2	Oct 17, 2010
Better way to replace/remove characters in a list of strings.	7	Sep 4, 2006
Extracting/finding strings from a list	2	Jul 19, 2004
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 22, 2007
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.vhdl FAQ part 1 of 4: general	0	Jul 8, 2003
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 4, 2004

Fastest way to remove common substrings from a list of strings

freesoft12

Mirco Wahab

Dave B

Dave B

Ted Zlatanov

Mirco Wahab

Dave B

xhoster

Mirco Wahab

Dave B

comp.llang.perl.moderated

szr

freesoft12

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads