Index of first and last non-"\xff" in a long string

W

w.c.humann

I'm going through several PGM images, overlaying (i.e. ANDing) them
and would also like to determine the bounding box. For that I need to
find the first and last non-white (i.e. non-"\xff") pixel in every
line. Now I have one line in a string $data (one pixel per character),
possibly several 1000 charcters long. I tried 3 alternatives so far.
All 3 work, but there may be even faster ways to do this:

# slow
my $first = -1;
1 while substr($data, ++$first, 1) eq "\xff" and $first < $width - 1;
my $last = $width;
1 while substr($data, --$last, 1) eq "\xff" and $last > $first;
print STDERR "f: $first, l: $last, ";

# the match for $first2 is fast, but the one for $last2 is really slow
my $first2 = length( ($data =~ /^(\xff+)/)[0] );
my $last2 = $width - 1 - length( ($data =~ /(\xff+)$/)[0] );
print STDERR "f2: $first2, l2: $last2, ";

# best solution so far. "tr" is the slowest part of this.
# Is there a way without the "tr"?
$data =~ tr|\x00-\xfe|\x00|;
my $first3 = index $data, "\x00";
my $last3 = ($first3 > -1) ? rindex $data, "\x00" : -1;
print STDERR "f3: $first3, l3: $last3, ";

print STDERR "\n";

Thanks,
Wolfram
 
B

Ben Morrow

Quoth (e-mail address removed):
I'm going through several PGM images, overlaying (i.e. ANDing) them
and would also like to determine the bounding box. For that I need to
find the first and last non-white (i.e. non-"\xff") pixel in every
line. Now I have one line in a string $data (one pixel per character),
possibly several 1000 charcters long. I tried 3 alternatives so far.
All 3 work, but there may be even faster ways to do this:

On my machine, the benchmark below gives

Rate subst sloop chop match reverse index C
subst 403/s -- -90% -94% -96% -97% -97% -100%
sloop 4078/s 912% -- -36% -58% -73% -73% -96%
chop 6340/s 1474% 55% -- -35% -58% -59% -94%
match 9754/s 2322% 139% 54% -- -36% -37% -91%
reverse 15247/s 3686% 274% 140% 56% -- -1% -85%
index 15376/s 3718% 277% 143% 58% 1% -- -85%
C 104065/s 25739% 2452% 1541% 967% 583% 577% --

so index is probably the best you're going to get without using C.

Ben

#!/usr/bin/perl

use Benchmark qw/cmpthese/;

my $str = ("\xff" x 160) . ("f" x 10_000) . ("\xff" x 150);
my $len = length $str;

use Inline C => <<'EOC';

IV
unindex(SV *sv, const char *str)
{
const char *pv;
IV len, i = 0;
const char chr = str[0];

pv = SvPV(sv, len);
while (pv == chr) i++;

return i;
}

IV
unrindex(SV *sv, const char *str)
{
const char *pv;
IV len, i;
const char chr = str[0];

pv = SvPV(sv, len);
i = len - 1;
while (pv == chr) i--;

return i;
}

EOC

cmpthese -3, {
match => sub {
local $_ = $str;
# this is the fastest single match I can come up with
/^((?>\xff*)).*[^\xff]((?>\xff*))$/;
161 == 1 + $+[1] or die "\$+ failed: " . (1 + $+[1]);
10_160 == $-[2] or die "\$- failed: " . $-[2];
},
index => sub {
local $_ = $str;
tr,\x00-\xfe,\x00,;
161 == 1 + index $_, "\x00"
or die "index failed: " . 1 + index $_, "\x00";
10_160 == 1 + rindex $_, "\x00"
or die "rindex failed: " . 1 + rindex $_, "\x00";
},
subst => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
s/\xff+$//;
10_160 == length or die "subst failed: " . length;
},
sloop => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
1 while s/\xff$//;
10_160 == length or die "sloop failed: " . length;
},
chop => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
1 while "\xff" eq chop;
10_160 == 1 + length or die "chop failed: " . (1 + length);
},
reverse => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
$_ = reverse;
/[^\xff]/g;
10_160 == (1 + $len - pos)
or die "reverse failed: " . (1 + $len - pos);
},
C => sub {
local $_ = $str;
161 == 1 + unindex $_, "\xff"
or die "unindex failed: " . (1 + unindex $_, "\xff");
10_160 == 1 + unrindex $_, "\xff"
or die "unrindex failed: " . (1 + unrindex $_, "\xff");
},
};
 
J

John W. Krahn

I'm going through several PGM images, overlaying (i.e. ANDing) them
and would also like to determine the bounding box. For that I need to
find the first and last non-white (i.e. non-"\xff") pixel in every
line. Now I have one line in a string $data (one pixel per character),
possibly several 1000 charcters long. I tried 3 alternatives so far.
All 3 work, but there may be even faster ways to do this:

# slow
my $first = -1;
1 while substr($data, ++$first, 1) eq "\xff" and $first < $width - 1;
my $last = $width;
1 while substr($data, --$last, 1) eq "\xff" and $last > $first;
print STDERR "f: $first, l: $last, ";

# the match for $first2 is fast, but the one for $last2 is really slow
my $first2 = length( ($data =~ /^(\xff+)/)[0] );
my $last2 = $width - 1 - length( ($data =~ /(\xff+)$/)[0] );
print STDERR "f2: $first2, l2: $last2, ";

# best solution so far. "tr" is the slowest part of this.
# Is there a way without the "tr"?
$data =~ tr|\x00-\xfe|\x00|;
my $first3 = index $data, "\x00";
my $last3 = ($first3 > -1) ? rindex $data, "\x00" : -1;
print STDERR "f3: $first3, l3: $last3, ";

print STDERR "\n";

Try this in your testing:

$data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
] - 1 );



John
 
W

w.c.humann

Try this in your testing:

$data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
] - 1 );

Thanks John,

at least a hundred times faster than my attempt at pattern matching --
but still several times slower than tr/index/rindex.

Wolfram
 
W

w.c.humann

On my machine, the benchmark below gives

Rate subst sloop chop match reverse index C
subst 403/s -- -90% -94% -96% -97% -97% -100%
sloop 4078/s 912% -- -36% -58% -73% -73% -96%
chop 6340/s 1474% 55% -- -35% -58% -59% -94%
match 9754/s 2322% 139% 54% -- -36% -37% -91%
reverse 15247/s 3686% 274% 140% 56% -- -1% -85%
index 15376/s 3718% 277% 143% 58% 1% -- -85%
C 104065/s 25739% 2452% 1541% 967% 583% 577% --

so index is probably the best you're going to get without using C.

Hey, some great ideas here, thanks Ben.

Glad I had already found the fastest pure-perl solution :)
(but 'reverse' is so close, the order might change per run...)

'Inline C' is great but less portable, and I'm mainly using this on
win32.

Wolfram
 
B

Ben Morrow

Quoth (e-mail address removed):
Try this in your testing:

$data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
] - 1 );

D'oh! I knew my match didn't need to do so much backtracking...
at least a hundred times faster than my attempt at pattern matching --
but still several times slower than tr/index/rindex.

Interesting... which version of perl? With
This is perl, v5.8.8 built for i386-freebsd-64int

and adding this

innerm => sub {
local $_ = $str;
/[^\xff].*[^\xff]/s;
161 == ($-[0] + 1) or die "innerm \$+ failed: " . ($-[0] + 1);
10_160 == $+[0] or die "innerm \$- failed: " . $+[0];
},

to my previous benchmark, I get

Rate match index innerm C
match 9609/s -- -41% -53% -91%
index 16398/s 71% -- -21% -85%
innerm 20641/s 115% 26% -- -81%
C 109225/s 1037% 566% 429% --

though seriously increasing the number of trailing "\xff"s causes both
'match' and 'innerm' to perform dramatically badly, so maybe this is an
artefact of my test string.

Ben
 
W

w.c.humann

Interesting... which version of perl? With
This is perl, v5.8.8 built for i386-freebsd-64int

Mine is:
This is perl, v5.8.7 built for MSWin32-x86-multi-thread
and adding this

innerm => sub {
local $_ = $str;
/[^\xff].*[^\xff]/s;
161 == ($-[0] + 1) or die "innerm \$+ failed: " . ($-[0] + 1);
10_160 == $+[0] or die "innerm \$- failed: " . $+[0];
},

to my previous benchmark, I get

Rate match index innerm C
match 9609/s -- -41% -53% -91%
index 16398/s 71% -- -21% -85%
innerm 20641/s 115% 26% -- -81%
C 109225/s 1037% 566% 429% --

Well, with your test-string I get:

Rate subst sloop chop match index reverse
innerm
subst 1306/s -- -91% -94% -94% -94% -95%
-97%
sloop 13917/s 965% -- -33% -41% -41% -52%
-72%
chop 20894/s 1499% 50% -- -11% -11% -28%
-59%
match 23417/s 1693% 68% 12% -- -1% -19%
-54%
index 23561/s 1704% 69% 13% 1% -- -18%
-53%
reverse 28902/s 2112% 108% 38% 23% 23% --
-43%
innerm 50510/s 3767% 263% 142% 116% 114% 75%
--

so indeed 'innerm' wins, but...
though seriously increasing the number of trailing "\xff"s causes both
'match' and 'innerm' to perform dramatically badly, so maybe this is an
artefact of my test string.

typical files have blank lines at top and bottom and if I modify your
script like this:

# lenght of left, middle and right part of string
#my ($l,$m,$r) = (160, 10_000, 150);
my ($l,$m,$r) = (5000, 2, 5000);

my $str = ("\xff" x $l) . ("f" x $m) . ("\xff" x $r);
my $len = length $str;

cmpthese -3, {
match => sub {
local $_ = $str;
# this is the fastest single match I can come up with
/^((?>\xff*)).*[^\xff]((?>\xff*))$/;
$l+1 == 1 + $+[1] or die "\$+ failed: " . (1 + $+[1]);
$l+$m == $-[2] or die "\$- failed: " . $-[2];
},
etc.

the result is:

Rate subst sloop chop match innerm
reverse index
subst 1.44/s -- -100% -100% -100% -100%
-100% -100%
sloop 462/s 32117% -- -37% -75% -82%
-86% -98%
chop 737/s 51284% 59% -- -60% -71%
-78% -96%
match 1844/s 128383% 299% 150% -- -28%
-45% -91%
innerm 2575/s 179348% 457% 249% 40% --
-23% -87%
reverse 3352/s 233498% 625% 355% 82% 30%
-- -84%
index 20452/s 1424967% 4323% 2673% 1009% 694%
510% --

Stange enough with my ($l,$m,$r) = (5000, 1, 5000); 'innerm' fails!?

An optimization not considered so far: Once I've found a left and
right bound in a line, I only need to check from the edges up these
bounds in all following lines, because my bounding-box can only grow
(and never shrink) while checking further lines. As a faked test I've
used this:

index2 => sub {
my $left = substr($str,0,$l+500);
my $right = substr($str,$l+$m-500,$r+500);
$left =~ tr,\x00-\xfe,\x00,;
$right =~ tr,\x00-\xfe,\x00,;
$l+1 == 1 + index $left, "\x00"
or die "index failed: " . (1 + index $left, "\x00");
500 == 1 + rindex $right, "\x00"
or die "rindex failed: " . (1 + rindex $right, "\x00");
},

The potential savings are big but of course highly dependent on the
actual image:
Rate index reverse innerm index2
index 23924/s -- -17% -53% -82%
reverse 28749/s 20% -- -43% -79%
innerm 50574/s 111% 76% -- -63%
index2 136616/s 471% 375% 170% --


Wolfram
 
J

John W. Krahn

Interesting... which version of perl? With
This is perl, v5.8.8 built for i386-freebsd-64int

and adding this

innerm => sub {
local $_ = $str;
/[^\xff].*[^\xff]/s;

[ SNIP ]
Stange enough with my ($l,$m,$r) = (5000, 1, 5000); 'innerm' fails!?

Not strange at all. The pattern has to match at least two [^\xff]
characters.


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top