How to speed up this slow part of my program

J

Justin C

We have a database of thousands of clothing items. Some of the items are
almost identical apart from their size. Consequently we use the same
image in our web-shop to advertise items of the same style, design and
colour.

In a program I have to get new images from the art guy's computer I end
up grepping the entire list of items $#(list-of-items) times, there must
be a better way. The file names are exactly the same as the style codes
apart from the size suffix being dropped. I'm using File::Find.

Here's some code:

find(\&set_flag, (keys %{ $stock_groups->{text2code} }));

sub set_flag {
return unless (-f $_ );

(my $item_code_part = $_) =~ s/\.jpg//;
$item_code_part = uc($item_code_part);
$item_code_part =~ s|_|/|g;

my @matches = grep(/$item_code_part/, keys %{ $stock_items });
foreach my $i (@matches) {
$stock_items->{$i}{got_image} = 1;
}
}

The 'find' iterates through two top level directories, 36 next level
directories in each of the top level, and about 20k files accross the
entire 72 level 2 directories. It then passes the filename to the sub
which compares the filename (which is only a partial stock code because
it may have several matches) with the hash of stock_items, pulling out
matches. Those matches are then iterated over and the items with an
available image get a hash element added and set to 1.

I can probably do the grep and iteration over the matches with map{...},
grep{...}, keys %{ $stock_items}; but I don't think that'll save much
time, and I'm not certain how to do, I can see how to create a new hash,
but I'm not sure if changing the hash while grep iterates through it is
a good idea.

The bottle-neck, as I see it, is running grep 20k times, once for each
image found. Can anyone suggest a better way?

Justin.
 
R

Rainer Weikusat

[...]
find(\&set_flag, (keys %{ $stock_groups->{text2code} }));

sub set_flag {
return unless (-f $_ );

(my $item_code_part = $_) =~ s/\.jpg//;
$item_code_part = uc($item_code_part);
$item_code_part =~ s|_|/|g;

my @matches = grep(/$item_code_part/, keys %{ $stock_items });
foreach my $i (@matches) {
$stock_items->{$i}{got_image} = 1;
}
}

The 'find' iterates through two top level directories, 36 next level
directories in each of the top level, and about 20k files accross the
entire 72 level 2 directories. It then passes the filename to the sub
which compares the filename (which is only a partial stock code because
it may have several matches) with the hash of stock_items, pulling out
matches. Those matches are then iterated over and the items with an
available image get a hash element added and set to 1.
[...]

The bottle-neck, as I see it, is running grep 20k times, once for each
image found. Can anyone suggest a better way?

"Doing linear scans over an associative array is like trying to club
someone to death with a loaded Uzi."

An obvious suggestion would be to traverse the stock_items keys once
an build a second hash which maps each item_code_part to an array
reference containing all the corresponding stock_items keys (or even
to a reference to the corresponding stock_items element) and use this
second hash to locate the keys which need to be changed (or the hash
locations which need to be changed) in constant time.

NB: Someone should probably repost that since 'Justin who keeps coming
out of my killfile' as chosen to 'block' himself from seeing my
answers to questions of him for 'clearly indecent behaviour' ...
 
T

Tim McDaniel

I would probably turn this into a big pattern match. Something like
this:

use File::Find::Rule;

my ($imgs) = map qr/$_/, join "|", map "\Q\U$_",
map { (my ($x) = /(.*)\.jpg/) =~ s!_!/!g; $x }
File::Find::Rule->file->in(keys %{...});

while (my ($item, $entry) = each %$stock_items) {
$item =~ $imgs and $entry->{got_image} = 1;
}

If you're using 5.14 you can get rid of the ugly map block using s///r
and tr///r:

map tr!_!/!r, map s/\.jpg//r,

This assumes the entries in %$stock_items are already hashrefs; if they
aren't a 'for (keys %$stock_items)' loop will be easier.

What's s///r and tr///r? I looked in "man perlop" and "man perlre"
for a system with 5.14.2, but I didn't see them.
 
R

Rainer Weikusat

Ben Morrow said:
Quoth Justin C <[email protected]>:
[...]
find(\&set_flag, (keys %{ $stock_groups->{text2code} }));

sub set_flag {
return unless (-f $_ );

(my $item_code_part = $_) =~ s/\.jpg//;
$item_code_part = uc($item_code_part);
$item_code_part =~ s|_|/|g;

my @matches = grep(/$item_code_part/, keys %{ $stock_items });

Careful: you want \Q there, even if you think you're sure the filenames
are all safe.
foreach my $i (@matches) {
$stock_items->{$i}{got_image} = 1;
}
}

I would probably turn this into a big pattern match. Something like
this:

use File::Find::Rule;

my ($imgs) = map qr/$_/, join "|", map "\Q\U$_",
map { (my ($x) = /(.*)\.jpg/) =~ s!_!/!g; $x }
File::Find::Rule->file->in(keys %{...});

while (my ($item, $entry) = each %$stock_items) {
$item =~ $imgs and $entry->{got_image} = 1;
}

Something I should already have written last time: A hash lookup is
(supposed to be) an O(1) operation. Matching against a set of
alternative patterns is O(n). In this particular case, the means that
the algorithm you suggest still scales as badly as the originally
used one in this respect (run time proportional to the of patters
times the number of stock items), except that it is possibly somewhat faster
(although I wouldn't want to bet on that blindly).

The sensible way to do a lot of dictionary lookups is using a
dictionary, especially considering that Perl has native support for
that.
 
R

Rainer Weikusat

Rainer Weikusat said:
Ben Morrow said:
Quoth Justin C <[email protected]>:
[...]
find(\&set_flag, (keys %{ $stock_groups->{text2code} }));

sub set_flag {
return unless (-f $_ );

(my $item_code_part = $_) =~ s/\.jpg//;
$item_code_part = uc($item_code_part);
$item_code_part =~ s|_|/|g;

my @matches = grep(/$item_code_part/, keys %{ $stock_items });

Careful: you want \Q there, even if you think you're sure the filenames
are all safe.
foreach my $i (@matches) {
$stock_items->{$i}{got_image} = 1;
}
}

I would probably turn this into a big pattern match. Something like
this:

use File::Find::Rule;

my ($imgs) = map qr/$_/, join "|", map "\Q\U$_",
map { (my ($x) = /(.*)\.jpg/) =~ s!_!/!g; $x }
File::Find::Rule->file->in(keys %{...});

while (my ($item, $entry) = each %$stock_items) {
$item =~ $imgs and $entry->{got_image} = 1;
}
[...]

Matching against a set of alternative patterns is O(n). In this
particular case, the means that the algorithm you suggest still
scales as badly as the originally used one

OTOH, I have now learnt why 'autothreading' is supposed to be
useful. I can imagine the following set of 'optimizations' which led
to it:

1. Do it as in the original example. The algorithm is
quadratic and doesn't scale, performance sucks.

2. Assemble a giant regexp: The algorithm is still quadratic
and doesn't scale, performance sucks.

3. Turn the giant regexp into a suitable kind of search tree:
Unfortunately, the algorithm is still quadratic and doesn't
scale, performance sucks.

4. Go back to 2, use as many cores as available in order to
try alternatives in parallell: Given that the algorithm
remains quadratic and doesn't scale, performance still sucks
but now, it at least flattens the complete system.

[5. Give up in despair and use a Hadoop-cluster. Add more
nodes whenever the fact that quadratic algorithms don't scale
and performance sucks because of this becomes to obvious].

Steps 1 - 5 also invole a successive increase in code complexity,
starting from 'more complicated than necessary' and ending with
'absolutely insanely more complicated than necessary' ...
 
R

Rainer Weikusat

[...]

I've had similar problems in my day-to-day work (iterating over
thousands of files and building relationships with database contents)
and found the only truly efficient approach is to avoid repetitive
lookups.

The only 'truly efficient' way to solve such a problem is 'use a
suitable lookup algorithm'. Otherwise, you are always betting on
"runtime proportional to n*n won't matter because n will always be
small".
 
R

Rainer Weikusat

Ben Morrow said:
Given the problem as stated, I don't believe that's practical.
$item_code_part can match the stock_item anywhere along its length, so
you would need to enter each stock_item under all of its substrings.
That obviously turns into a combinatorial nightmare very fast.

If, of course, there is some further constraint we don't know about,
like 'the item code part is the section of the stock item up to the
third underscore', then the approach you suggest is the right one.

Just that the posted code behaves in this way doesn't mean that this
behaviour is strictly necessary to solve the problem. If it was
otherwise, coding errors wouldn't exist. And in this particular case,
I would risk a bet on the assumption that the regular expression match
could be replaced with a comparison of the item_code_part and a
substring of each stock_items key whose start and length are the same
for each key and known in advance, IOW, that, given a stock_items key,
it's item_code_part can be calculated without knowing the set of all
item_code_parts presently corresponding to image files.

Indepedently of this, another sensible idea would be to turn the
'stock item key' into an attribute of a data structure, say, an array
reference, put these data structures into an array, traverse the array
to locate the first matching item, process that once it was found and
remove it from the array afterwards (or use a programming language
designed for people who don't want to be bothered with outlandish
distinctions like that, eg, PHP).
 
J

J. Gleixner

We have a database of thousands of clothing items. Some of the items are
almost identical apart from their size. Consequently we use the same
image in our web-shop to advertise items of the same style, design and
colour.
[...]
The bottle-neck, as I see it, is running grep 20k times, once for each
image found. Can anyone suggest a better way?

Since you already have data in a DB, I'd suggest looking at
associating these files, to the items, in the database.

Maybe store the path to the file, or possibly the image as a BLOB, a
many to one relationship.
 
R

Rainer Weikusat

[...]
3. Turn the giant regexp into a suitable kind of search tree:
Unfortunately, the algorithm is still quadratic and doesn't
scale, performance sucks.

This was an error on my part: The total running time should be
proportional to the number of stock_items times log(number of images)
and while this is proportional to the square of the logarithm, this is
not how these terms are commonly used and understood. But for 20k
images, the running time will still be proportional to the number of
stock items times 14, much larger than what can be achieved by
'throwing memory at the problem' aka 'using a hash table'.
 
R

Rainer Weikusat

Ben Morrow said:
The first part of that sentence does not imply the second: 'N is usually
small' and all that.

Assuming that the original problem was stated correctly, 'N' is not
small here. Further, when 'N' is 'small',
When you're comparing heavily optimized C like the
regex engine with pretty much *any* logic in Perl,
'small' can end up being quite large.

the overhead of 'pretty much any logic in Perl' shouldn't
matter. Also, last time I looked, hashes were a datatype built into
Perl.
 
J

Justin C

Careful: you want \Q there, even if you think you're sure the filenames
are all safe.

Done that, thank you.

I would probably turn this into a big pattern match. Something like
this:

use File::Find::Rule;

my ($imgs) = map qr/$_/, join "|", map "\Q\U$_",
map { (my ($x) = /(.*)\.jpg/) =~ s!_!/!g; $x }
File::Find::Rule->file->in(keys %{...});

while (my ($item, $entry) = each %$stock_items) {
$item =~ $imgs and $entry->{got_image} = 1;
}

That takes a bit of understanding, I'll have a read of the docs.

If you're using 5.14 you can get rid of the ugly map block using s///r
and tr///r:

map tr!_!/!r, map s/\.jpg//r,

Still on 5.10, we follow Debian 'stable' and so won't be upgrading for a
while.

Thanks for the suggestions.

Justin.
 
J

Justin C

We have a database of thousands of clothing items. Some of the items are
almost identical apart from their size. Consequently we use the same
image in our web-shop to advertise items of the same style, design and
colour.
[...]
The bottle-neck, as I see it, is running grep 20k times, once for each
image found. Can anyone suggest a better way?

Since you already have data in a DB, I'd suggest looking at
associating these files, to the items, in the database.

Maybe store the path to the file, or possibly the image as a BLOB, a
many to one relationship.

Unfortunately it's not a DB I'm authorised to write to. I can run
queries that extract data, but any changes to the data must be done
through the 3rd party supplied interface.


Justin.
 
D

Dr.Ruud

We have a database of thousands of clothing items. Some of the items are
almost identical apart from their size. Consequently we use the same
image in our web-shop to advertise items of the same style, design and
colour.

Are these (hard- or soft-) linked to a single file? If so, then you can
use that attribute.
 
J

J. Gleixner

We have a database of thousands of clothing items. Some of the items are
almost identical apart from their size. Consequently we use the same
image in our web-shop to advertise items of the same style, design and
colour.
[...]
The bottle-neck, as I see it, is running grep 20k times, once for each
image found. Can anyone suggest a better way?

Since you already have data in a DB, I'd suggest looking at
associating these files, to the items, in the database.

Maybe store the path to the file, or possibly the image as a BLOB, a
many to one relationship.

Unfortunately it's not a DB I'm authorised to write to. I can run
queries that extract data, but any changes to the data must be done
through the 3rd party supplied interface.

OK. Then put it in your own database. MySQL, Postgres, SQLite, even a
DBM file might work.
 
R

Rainer Weikusat

J. Gleixner said:
On 03/28/12 10:24, Justin C wrote:
We have a database of thousands of clothing items. Some of the items are
almost identical apart from their size. Consequently we use the same
image in our web-shop to advertise items of the same style, design and
colour.
[...]
The bottle-neck, as I see it, is running grep 20k times, once for each
image found. Can anyone suggest a better way?

Since you already have data in a DB, I'd suggest looking at
associating these files, to the items, in the database.

Maybe store the path to the file, or possibly the image as a BLOB, a
many to one relationship.

Unfortunately it's not a DB I'm authorised to write to. I can run
queries that extract data, but any changes to the data must be done
through the 3rd party supplied interface.

OK. Then put it in your own database. MySQL, Postgres, SQLite, even a
DBM file might work.

This is a 'simple' key <-> value mapping, only complicated by the fact
that keys can have multiple values associating with them. Provided
that the code for a particular stock item can be calculated, the
simple first attempt at a solution would still be to use a Perl hash
mapping codes to sets of stock item keys. This hash can be generated
once in advance and then be used for any number of 'fast' lookups. A
more elaborate design would be to use a flat-file hashed database ('DBM
file') to store the mappings, update that whenever the set of stock
items changes and use it to associate presently existing images with
stock items without recalculating the mappings. An even better idea
would be to use a persistent database mapping stock items to image
files and update this database whenever the set of image files
changes.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top