Index of first and last non-"\xff" in a long string

w.c.humann · Nov 12, 2007

I'm going through several PGM images, overlaying (i.e. ANDing) them
and would also like to determine the bounding box. For that I need to
find the first and last non-white (i.e. non-"\xff") pixel in every
line. Now I have one line in a string $data (one pixel per character),
possibly several 1000 charcters long. I tried 3 alternatives so far.
All 3 work, but there may be even faster ways to do this:

# slow
my $first = -1;
1 while substr($data, ++$first, 1) eq "\xff" and $first < $width - 1;
my $last = $width;
1 while substr($data, --$last, 1) eq "\xff" and $last > $first;
print STDERR "f: $first, l: $last, ";

# the match for $first2 is fast, but the one for $last2 is really slow
my $first2 = length( ($data =~ /^(\xff+)/)[0] );
my $last2 = $width - 1 - length( ($data =~ /(\xff+)$/)[0] );
print STDERR "f2: $first2, l2: $last2, ";

# best solution so far. "tr" is the slowest part of this.
# Is there a way without the "tr"?
$data =~ tr|\x00-\xfe|\x00|;
my $first3 = index $data, "\x00";
my $last3 = ($first3 > -1) ? rindex $data, "\x00" : -1;
print STDERR "f3: $first3, l3: $last3, ";

print STDERR "\n";

Thanks,
Wolfram

Ben Morrow · Nov 12, 2007

Quoth (e-mail address removed):

I'm going through several PGM images, overlaying (i.e. ANDing) them
and would also like to determine the bounding box. For that I need to
find the first and last non-white (i.e. non-"\xff") pixel in every
line. Now I have one line in a string $data (one pixel per character),
possibly several 1000 charcters long. I tried 3 alternatives so far.
All 3 work, but there may be even faster ways to do this:

On my machine, the benchmark below gives

Rate subst sloop chop match reverse index C
subst 403/s -- -90% -94% -96% -97% -97% -100%
sloop 4078/s 912% -- -36% -58% -73% -73% -96%
chop 6340/s 1474% 55% -- -35% -58% -59% -94%
match 9754/s 2322% 139% 54% -- -36% -37% -91%
reverse 15247/s 3686% 274% 140% 56% -- -1% -85%
index 15376/s 3718% 277% 143% 58% 1% -- -85%
C 104065/s 25739% 2452% 1541% 967% 583% 577% --

so index is probably the best you're going to get without using C.

Ben

#!/usr/bin/perl

use Benchmark qw/cmpthese/;

my $str = ("\xff" x 160) . ("f" x 10_000) . ("\xff" x 150);
my $len = length $str;

use Inline C => <<'EOC';

IV
unindex(SV *sv, const char *str)
{
const char *pv;
IV len, i = 0;
const char chr = str[0];

pv = SvPV(sv, len);
while (pv == chr) i++;

return i;
}

IV
unrindex(SV *sv, const char *str)
{
const char *pv;
IV len, i;
const char chr = str[0];

pv = SvPV(sv, len);
i = len - 1;
while (pv == chr) i--;

return i;
}

EOC

cmpthese -3, {
match => sub {
local $_ = $str;
# this is the fastest single match I can come up with
/^((?>\xff*)).*[^\xff]((?>\xff*))$/;
161 == 1 + $+[1] or die "\$+ failed: " . (1 + $+[1]);
10_160 == $-[2] or die "\$- failed: " . $-[2];
},
index => sub {
local $_ = $str;
tr,\x00-\xfe,\x00,;
161 == 1 + index $_, "\x00"
or die "index failed: " . 1 + index $_, "\x00";
10_160 == 1 + rindex $_, "\x00"
or die "rindex failed: " . 1 + rindex $_, "\x00";
},
subst => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
s/\xff+$//;
10_160 == length or die "subst failed: " . length;
},
sloop => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
1 while s/\xff$//;
10_160 == length or die "sloop failed: " . length;
},
chop => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
1 while "\xff" eq chop;
10_160 == 1 + length or die "chop failed: " . (1 + length);
},
reverse => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
$_ = reverse;
/[^\xff]/g;
10_160 == (1 + $len - pos)
or die "reverse failed: " . (1 + $len - pos);
},
C => sub {
local $_ = $str;
161 == 1 + unindex $_, "\xff"
or die "unindex failed: " . (1 + unindex $_, "\xff");
10_160 == 1 + unrindex $_, "\xff"
or die "unrindex failed: " . (1 + unrindex $_, "\xff");
},
};

John W. Krahn · Nov 12, 2007

I'm going through several PGM images, overlaying (i.e. ANDing) them
and would also like to determine the bounding box. For that I need to
find the first and last non-white (i.e. non-"\xff") pixel in every
line. Now I have one line in a string $data (one pixel per character),
possibly several 1000 charcters long. I tried 3 alternatives so far.
All 3 work, but there may be even faster ways to do this:

# slow
my $first = -1;
1 while substr($data, ++$first, 1) eq "\xff" and $first < $width - 1;
my $last = $width;
1 while substr($data, --$last, 1) eq "\xff" and $last > $first;
print STDERR "f: $first, l: $last, ";

# the match for $first2 is fast, but the one for $last2 is really slow
my $first2 = length( ($data =~ /^(\xff+)/)[0] );
my $last2 = $width - 1 - length( ($data =~ /(\xff+)$/)[0] );
print STDERR "f2: $first2, l2: $last2, ";

# best solution so far. "tr" is the slowest part of this.
# Is there a way without the "tr"?
$data =~ tr|\x00-\xfe|\x00|;
my $first3 = index $data, "\x00";
my $last3 = ($first3 > -1) ? rindex $data, "\x00" : -1;
print STDERR "f3: $first3, l3: $last3, ";

print STDERR "\n";

Try this in your testing:

$data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
] - 1 );

John

w.c.humann · Nov 12, 2007

Try this in your testing:

$data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
] - 1 );

Thanks John,

at least a hundred times faster than my attempt at pattern matching --
but still several times slower than tr/index/rindex.

Wolfram

w.c.humann · Nov 12, 2007

On my machine, the benchmark below gives

Rate subst sloop chop match reverse index C
subst 403/s -- -90% -94% -96% -97% -97% -100%
sloop 4078/s 912% -- -36% -58% -73% -73% -96%
chop 6340/s 1474% 55% -- -35% -58% -59% -94%
match 9754/s 2322% 139% 54% -- -36% -37% -91%
reverse 15247/s 3686% 274% 140% 56% -- -1% -85%
index 15376/s 3718% 277% 143% 58% 1% -- -85%
C 104065/s 25739% 2452% 1541% 967% 583% 577% --

so index is probably the best you're going to get without using C.

Hey, some great ideas here, thanks Ben.

Glad I had already found the fastest pure-perl solution

(but 'reverse' is so close, the order might change per run...)

'Inline C' is great but less portable, and I'm mainly using this on
win32.

Wolfram

Ben Morrow · Nov 12, 2007

Quoth (e-mail address removed):

Try this in your testing:

$data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
] - 1 );

Click to expand...

D'oh! I knew my match didn't need to do so much backtracking...

at least a hundred times faster than my attempt at pattern matching --
but still several times slower than tr/index/rindex.

Interesting... which version of perl? With
This is perl, v5.8.8 built for i386-freebsd-64int

and adding this

innerm => sub {
local $_ = $str;
/[^\xff].*[^\xff]/s;
161 == ($-[0] + 1) or die "innerm \$+ failed: " . ($-[0] + 1);
10_160 == $+[0] or die "innerm \$- failed: " . $+[0];
},

to my previous benchmark, I get

Rate match index innerm C
match 9609/s -- -41% -53% -91%
index 16398/s 71% -- -21% -85%
innerm 20641/s 115% 26% -- -81%
C 109225/s 1037% 566% 429% --

though seriously increasing the number of trailing "\xff"s causes both
'match' and 'innerm' to perform dramatically badly, so maybe this is an
artefact of my test string.

Ben

w.c.humann · Nov 13, 2007

Interesting... which version of perl? With
This is perl, v5.8.8 built for i386-freebsd-64int

Mine is:
This is perl, v5.8.7 built for MSWin32-x86-multi-thread

and adding this

innerm => sub {
local $_ = $str;
/[^\xff].*[^\xff]/s;
161 == ($-[0] + 1) or die "innerm \$+ failed: " . ($-[0] + 1);
10_160 == $+[0] or die "innerm \$- failed: " . $+[0];
},

to my previous benchmark, I get

Rate match index innerm C
match 9609/s -- -41% -53% -91%
index 16398/s 71% -- -21% -85%
innerm 20641/s 115% 26% -- -81%
C 109225/s 1037% 566% 429% --

Well, with your test-string I get:

Rate subst sloop chop match index reverse
innerm
subst 1306/s -- -91% -94% -94% -94% -95%
-97%
sloop 13917/s 965% -- -33% -41% -41% -52%
-72%
chop 20894/s 1499% 50% -- -11% -11% -28%
-59%
match 23417/s 1693% 68% 12% -- -1% -19%
-54%
index 23561/s 1704% 69% 13% 1% -- -18%
-53%
reverse 28902/s 2112% 108% 38% 23% 23% --
-43%
innerm 50510/s 3767% 263% 142% 116% 114% 75%
--

so indeed 'innerm' wins, but...

though seriously increasing the number of trailing "\xff"s causes both
'match' and 'innerm' to perform dramatically badly, so maybe this is an
artefact of my test string.

typical files have blank lines at top and bottom and if I modify your
script like this:

# lenght of left, middle and right part of string
#my ($l,$m,$r) = (160, 10_000, 150);
my ($l,$m,$r) = (5000, 2, 5000);

my $str = ("\xff" x $l) . ("f" x $m) . ("\xff" x $r);
my $len = length $str;

cmpthese -3, {
match => sub {
local $_ = $str;
# this is the fastest single match I can come up with
/^((?>\xff*)).*[^\xff]((?>\xff*))$/;
$l+1 == 1 + $+[1] or die "\$+ failed: " . (1 + $+[1]);
$l+$m == $-[2] or die "\$- failed: " . $-[2];
},
etc.

the result is:

Rate subst sloop chop match innerm
reverse index
subst 1.44/s -- -100% -100% -100% -100%
-100% -100%
sloop 462/s 32117% -- -37% -75% -82%
-86% -98%
chop 737/s 51284% 59% -- -60% -71%
-78% -96%
match 1844/s 128383% 299% 150% -- -28%
-45% -91%
innerm 2575/s 179348% 457% 249% 40% --
-23% -87%
reverse 3352/s 233498% 625% 355% 82% 30%
-- -84%
index 20452/s 1424967% 4323% 2673% 1009% 694%
510% --

Stange enough with my ($l,$m,$r) = (5000, 1, 5000); 'innerm' fails!?

An optimization not considered so far: Once I've found a left and
right bound in a line, I only need to check from the edges up these
bounds in all following lines, because my bounding-box can only grow
(and never shrink) while checking further lines. As a faked test I've
used this:

index2 => sub {
my $left = substr($str,0,$l+500);
my $right = substr($str,$l+$m-500,$r+500);
$left =~ tr,\x00-\xfe,\x00,;
$right =~ tr,\x00-\xfe,\x00,;
$l+1 == 1 + index $left, "\x00"
or die "index failed: " . (1 + index $left, "\x00");
500 == 1 + rindex $right, "\x00"
or die "rindex failed: " . (1 + rindex $right, "\x00");
},

The potential savings are big but of course highly dependent on the
actual image:
Rate index reverse innerm index2
index 23924/s -- -17% -53% -82%
reverse 28749/s 20% -- -43% -79%
innerm 50574/s 111% 76% -- -63%
index2 136616/s 471% 375% 170% --

Wolfram

John W. Krahn · Nov 13, 2007

Interesting... which version of perl? With
This is perl, v5.8.8 built for i386-freebsd-64int

and adding this

innerm => sub {
local $_ = $str;
/[^\xff].*[^\xff]/s;

Click to expand...

[ SNIP ]

Stange enough with my ($l,$m,$r) = (5000, 1, 5000); 'innerm' fails!?

Not strange at all. The pattern has to match at least two [^\xff]
characters.

John

How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 22, 2023
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 22, 2007
Identify if a scalar is int, double or text	16	May 11, 2007
Remove first letter of each string in array	11	May 27, 2004
Shallow copy and long long double on gcc 3.4.5	6	Sep 18, 2009
[SUMMARY] Index and Query (#54)	0	Nov 17, 2005
my first screen scraper	0	Dec 2, 2007

Index of first and last non-"\xff" in a long string

w.c.humann

Ben Morrow

John W. Krahn

w.c.humann

w.c.humann

Ben Morrow

w.c.humann

John W. Krahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads