SUBSTR() with replacement or lvalue performance issues

S

sln

I've read the docs on substr many a times but I still am not
quite clear on if being used as a lvalue or the replacement parameter.

I have a possible quite large string (could be megabytes).
I wan't to insert, possible in the middle a replacement text.
I'm running through an itteration on the string throught sub's etc.

Apart from like, copy from the start of a matched position, to a
file (as opposed to another buffer), then catenating the modification
to the file, then continue on with the next match, is the substr
(lvalue or replacement) a viable option?

I have to consider performance on such large operations.

What do you think would be the performance 'hit' if modifying
the string in-place using substr as either an lvalue or replacement?

There has to be some memcpy()'s or moves involved.
If replacement based, I can adjust the pos() for the next match,
but to insert even a little change in string size, in the middle
of a very large string could be a big performance hit?

All help is appretiated!
TIA

sln
 
X

xhoster

I've read the docs on substr many a times but I still am not
quite clear on if being used as a lvalue or the replacement parameter.

I have a possible quite large string (could be megabytes).
I wan't to insert, possible in the middle a replacement text.
I'm running through an itteration on the string throught sub's etc.

Apart from like, copy from the start of a matched position, to a
file (as opposed to another buffer), then catenating the modification
to the file, then continue on with the next match,

Are you describing just the ordinary practice if writing your output to
an output file while looping over a read of the input file? That is
often a good way to do things.
is the substr
(lvalue or replacement) a viable option?

I have to consider performance on such large operations.

What do you think would be the performance 'hit' if modifying
the string in-place using substr as either an lvalue or replacement?

On my system it takes a little under a second to use substr to splice
into the middle of a 1GB string (in a way that makes the string longer)

##baseline:
time perl -le 'my $x; $x.="x"x1000 foreach 1..1e6; \
substr $x, 4e8, 3, "yyyy" foreach 1..0 ;'
1.352u 0.904s 0:02.29 98.2% 0+0k 0+0io 0pf+0w

##20 substrs
time perl -le 'my $x; $x.="x"x1000 foreach 1..1e6; \
substr $x, 4e8, 3, "yyyy" foreach 1..0 ;'
16.585u 1.084s 0:18.31 96.4% 0+0k 0+0io 0pf+0w

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
S

sln

Are you describing just the ordinary practice if writing your output to
an output file while looping over a read of the input file? That is
often a good way to do things.
[snip]

Yes. I'm looking at your benchmarks for substr(). I have nightmares on
if threre are like over 100,000 replacements in a gigantic string which
could easily happen. There is a block replacement, I'm not doing s///g
on a gigantic string. I'm finding a block, then acting on indirect values,
then re-inserting them in the stream. Regexp has stopped, to be re'pos()
later (or not). Depends on if a seperate file is being constructed or
the replacement is acting strictly on the buffer.

I'm looking over you benches. I have a bad feeling though. Insertion
(or deletion) several hundred thousand possible times could reek havoc
maybe.

Checking... thanks!

sln
 
M

Michele Dondi

Apart from like, copy from the start of a matched position, to a
file (as opposed to another buffer), then catenating the modification
to the file, then continue on with the next match, is the substr
(lvalue or replacement) a viable option?

I have to consider performance on such large operations.

ISTR that the lvaluedness of substr()'s return value, as long as the
fact that you can EVEN take references of it and modify the string
with a sort of action-at-distance was put there specifically for
performance issues. At some point there were problems with
substitutions having a lenght larger than the substituted IalsoIRC,
but they should be solved in recent enough perls.

See: <http://perlmonks.org/?node_id=498434>


Michele
 
S

smallpond

I'm looking over you benches. I have a bad feeling though. Insertion
(or deletion) several hundred thousand possible times could reek havoc
maybe.


code can reek, but havoc must be wreaked.

It is good practice to see if you actually have a problem before doing
optimization. "First make it right, then make it fast."
 
M

Mirco Wahab

Yes. I'm looking at your benchmarks for substr(). I have nightmares on
if threre are like over 100,000 replacements in a gigantic string which
could easily happen. There is a block replacement, I'm not doing s///g
on a gigantic string. I'm finding a block, then acting on indirect values,
then re-inserting them in the stream. Regexp has stopped, to be re'pos()
later (or not). Depends on if a seperate file is being constructed or
the replacement is acting strictly on the buffer.

What is the problem domain? "Megabyte strings" and
"100,000 things" to might not turn out that slow on
a usual 3GHz Core2 that has 6MB L2 available.
I'm looking over you benches. I have a bad feeling though. Insertion
(or deletion) several hundred thousand possible times could reek havoc
maybe.

A 'left value'-substr() might be beaten by an
"Inline C" based use of memchr/memcpy - that
unfortunately would need to handle the reallocation
of the buffer manually (if possible/feasible at
all in your problem range).

Regards

M.
 
X

xhoster

Michele Dondi said:
ISTR that the lvaluedness of substr()'s return value, as long as the
fact that you can EVEN take references of it and modify the string
with a sort of action-at-distance was put there specifically for
performance issues. At some point there were problems with
substitutions having a lenght larger than the substituted IalsoIRC,
but they should be solved in recent enough perls.

See: <http://perlmonks.org/?node_id=498434>

My reading of that is there used to be problems having more than one
action-at-distance references outstanding on the same string, regardless
of the size of the replacements.

Doing replacements that don't preserve length can have a performance
impact, but I don't see that as a "problem" in the same way a bug is a
"problem"; just as one of the trade-offs that always exist and which need
to be kept in mind. I doubt this performance issue will be "fixed" anytime
soon, as it would likely require a fundamental change of the way strings
are managed (i.e. as ropes rather than as contiguous memory regions.)

Of course, the other issue is one of semantics. If I take reference like
my $x=\substr($q,100,10), and then later do an insertion like
substr($q,10,0,"xxx"), will $x now refer to the same *characters* as it did
before (i.e. substr($q,103,10)) or to the same *positions* as it did
before. Empirically, it refers to the same positions.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
S

sln

ISTR that the lvaluedness of substr()'s return value, as long as the
fact that you can EVEN take references of it and modify the string
with a sort of action-at-distance was put there specifically for
performance issues. At some point there were problems with
substitutions having a lenght larger than the substituted IalsoIRC,
but they should be solved in recent enough perls.

See: <http://perlmonks.org/?node_id=498434>


Michele

If c, place a 0 at the start of find, save pointer to begin of
last find, add ptr to list.
Create a new char[modified size], add ptr to list
Repeat until end of string.

Write pointer list to file/buffer (file).
Delete list of ptrs.

Perl can't do reference mid string. Jimmy jack it maybe..

sln
 
S

sln

ISTR that the lvaluedness of substr()'s return value, as long as the
fact that you can EVEN take references of it and modify the string
with a sort of action-at-distance was put there specifically for
performance issues. At some point there were problems with
substitutions having a lenght larger than the substituted IalsoIRC,
but they should be solved in recent enough perls.

See: <http://perlmonks.org/?node_id=498434>


Michele

^^^^^^^^^^^^

Being able to get segment references while not altering the
string works pretty good. Altering the string with the ref's is
possible but I wouldn't trust it and the string would still shrink/expand.

In the simplest example usage, something like below would seem to solve
performance issues. Thanks for the link!


sln
---------------------------------

use strict;
use warnings;

my $bigstring = \"some big big big scalar string";
my $modstring;
my $lastpos = 0;
my @segrefs = ();

while ($$bigstring =~ /big/g)
{
my ($offset, $curpos) = ($-[0], pos($$bigstring));

# modify part (local copy) of the big string
$modstring = substr $$bigstring, $offset, ($curpos - $offset);
$modstring .= "-huge";

# cache the interval (read only) and modstring references
push @segrefs, \substr $$bigstring, $lastpos, ($offset - $lastpos);
push @segrefs, \$modstring;

$lastpos = $curpos;
}

# print the new string (to a file maybe)

if ($lastpos)
{
push @segrefs, \substr $$bigstring, $lastpos;
for (@segrefs) {
print $$_;
}
}
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was NOT [per weedlist] sent to
Michele Dondi
ISTR that the lvaluedness of substr()'s return value, as long as the
fact that you can EVEN take references of it and modify the string
with a sort of action-at-distance was put there specifically for
performance issues. At some point there were problems with
substitutions having a lenght larger than the substituted IalsoIRC,
but they should be solved in recent enough perls.

See: <http://perlmonks.org/?node_id=498434>

Simple experiments show that it is still buggy with 5.8.8: code below returns

the quick brown fox jumps over the laxy dog
the quick brown fox jumps over the lazy dog

Hope this helps,
Ilya

#!/usr/bin/perl -w
use strict;

my $bigScalar = 'the quick brown fox jumps over the laxy dog';

sub change_nth ($$@) {
my($n, $subst) = (shift, shift);
$_[$n] = $subst;
return; # Just in case: avoid $_[7] being returned
}

change_nth 7, 'lazy', map{ substr $bigScalar, $_->[0], $_->[1] }
[0,3], [4,5], [10,5], [16,3], [20,5], [26,4], [31,3], [35,4], [40,3];
print "$bigScalar\n";

change_nth 1, 'lazy', substr($bigScalar, 31, 3), substr($bigScalar, 35, 4),
substr($bigScalar, 40, 3);
print "$bigScalar\n";
__END__
 
M

Michele Dondi

Simple experiments show that it is still buggy with 5.8.8: code below returns

the quick brown fox jumps over the laxy dog
the quick brown fox jumps over the lazy dog

Same with 5.10 here.


Michele
 
X

xhoster

Ilya Zakharevich said:
[A complimentary Cc of this posting was NOT [per weedlist] sent to
Michele Dondi
ISTR that the lvaluedness of substr()'s return value, as long as the
fact that you can EVEN take references of it and modify the string
with a sort of action-at-distance was put there specifically for
performance issues. At some point there were problems with
substitutions having a lenght larger than the substituted IalsoIRC,
but they should be solved in recent enough perls.

See: <http://perlmonks.org/?node_id=498434>

Simple experiments show that it is still buggy with 5.8.8: code below
returns
....


change_nth 7, 'lazy', map{ substr $bigScalar, $_->[0], $_->[1] }
[0,3], [4,5], [10,5], [16,3], [20,5], [26,4], [31,3], [35,4], [40,3];
print "$bigScalar\n";

I don't see this as a bug with substr. "Map" doesn't alias the values
it returns, it copies them. So the magic is no longer there, just as it
isn't present in $x if you do:
my $x=substr $bigScalar, 1, 4;

So if it's a bug, it is a bug with map, not substr.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
S

sln

Ilya Zakharevich said:
[A complimentary Cc of this posting was NOT [per weedlist] sent to
Michele Dondi
ISTR that the lvaluedness of substr()'s return value, as long as the
fact that you can EVEN take references of it and modify the string
with a sort of action-at-distance was put there specifically for
performance issues. At some point there were problems with
substitutions having a lenght larger than the substituted IalsoIRC,
but they should be solved in recent enough perls.

See: <http://perlmonks.org/?node_id=498434>

Simple experiments show that it is still buggy with 5.8.8: code below
returns
...


change_nth 7, 'lazy', map{ substr $bigScalar, $_->[0], $_->[1] }
^
\

[0,3], [4,5], [10,5], [16,3], [20,5], [26,4], [31,3], [35,4], [40,3];
print "$bigScalar\n";

I don't see this as a bug with substr. "Map" doesn't alias the values
it returns, it copies them. So the magic is no longer there, just as it
isn't present in $x if you do:
my $x=substr $bigScalar, 1, 4;

So if it's a bug, it is a bug with map, not substr.

Xho
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to

Simple experiments show that it is still buggy with 5.8.8: code below
returns
change_nth 7, 'lazy', map{ substr $bigScalar, $_->[0], $_->[1] }
[0,3], [4,5], [10,5], [16,3], [20,5], [26,4], [31,3], [35,4], [40,3];
print "$bigScalar\n";
I don't see this as a bug with substr.

It is a bug, and it involves substr. ;-)
"Map" doesn't alias the values it returns, it copies them.

What for? Why this should not work:

perl -wle "@a=(10..15); sub z2{$_[2]=0}; z2 map $_, @a; print for @a"

Thanks,
Ilya
 
X

xhoster

Ilya Zakharevich said:
[A complimentary Cc of this posting was sent to
Simple experiments show that it is still buggy with 5.8.8: code
below returns
change_nth 7, 'lazy', map{ substr $bigScalar, $_->[0], $_->[1] }
^
\

Nope.

It will work provided you make the obvious change to change_nth to do the
dereference.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
S

sln

ISTR that the lvaluedness of substr()'s return value, as long as the
fact that you can EVEN take references of it and modify the string
with a sort of action-at-distance was put there specifically for
performance issues. At some point there were problems with
substitutions having a lenght larger than the substituted IalsoIRC,
but they should be solved in recent enough perls.

See: <http://perlmonks.org/?node_id=498434>


Michele

5.8.6 has problems:

use strict;
use warnings;

my $cnt;

printf "\nPerl Version: %vd\n", $^V;
my @fldvals = split( ' ', "one two three four five");

## Initalize references on dummy record
## -------------------------------------
print "\nInit \@lvrefs to 5 char fields x 7 fields/record\n",'-'x30,"\n";

my $bigscalar = "+++++-----+++++-----+++++-----+++++";
my @lvrefs = map{ \substr $bigscalar, $_->[0], $_->[1] }
[0,5], [5,5], [10,5], [15,5], [20,5], [25,5], [30,5];
print "\nrecord: <".$bigscalar.">\n\n";
$cnt = 1;
print " field ".$cnt++." \t<".$$_.">\n" for @lvrefs;

## Change record, print $$lvrefs
## -------------------------------------
print "\nChange record, print \$\$lvrefs\n",'-'x30,"\n";

# Set bigscalar to 6 char width fields, use @lvrefs to print
$bigscalar = sprintf ("%-6s"x5, @fldvals);
print "\n(\%-6s) record: <".$bigscalar.">\n\n";
$cnt = 1;
print " field ".$cnt++." \t<".$$_.">\n" for @lvrefs;

# Set bigscalar to 5 char width fields, use @lvrefs to print
$bigscalar = sprintf ("%-5s"x5, @fldvals);
print "\n(\%-5s) record: <".$bigscalar.">\n\n";
$cnt = 1;
print " field ".$cnt++." \t<".$$_.">\n" for @lvrefs;

## Assign to $$lvrefs, print record
## -------------------------------------
print "\nAssign to \$\$lvrefs, print record\n",'-'x30,"\n";

# Assign $$lvref to 10 char value's
print "\n(\%-10s) \$\$lvrefs\n\n";
$cnt = 0;
for (@fldvals)
{ ${$lvrefs[$cnt++]} = sprintf "%-10s", $_; }
$cnt = 1;
print " field ".$cnt++." \t<".$$_.">\n" for @lvrefs;
print "\nrecord: <".$bigscalar.">\n";

# Assign $$lvref to 5 char value's
print "\n(\%-5s) \$\$lvrefs\n\n";
$cnt = 0;
for (@fldvals)
{ ${$lvrefs[$cnt++]} = sprintf "%-5s", $_; }
$cnt = 1;
print " field ".$cnt++." \t<".$$_.">\n" for @lvrefs;
print "\nrecord: <".$bigscalar.">\n";

# Assign $$lvref to 2 char value's
print "\n(\%-2d, 1..7) \$\$lvrefs\n\n";
$cnt = 0;
for (1..7)
{ ${$lvrefs[$cnt++]} = sprintf "%-2d", $_; }
$cnt = 1;
print " field ".$cnt++." \t<".$$_.">\n" for @lvrefs;
print "\nrecord: <".$bigscalar.">\n";

__END__


Perl Version: 5.8.6

Init @lvrefs to 5 char fields x 7 fields/record
------------------------------

record: <+++++-----+++++-----+++++-----+++++>

field 1 <+++++>
field 2 <----->
field 3 <+++++>
field 4 <----->
field 5 <+++++>
field 6 <----->
field 7 <+++++>

Change record, print $$lvrefs
------------------------------

(%-6s) record: <one two three four five >

field 1 <one >
field 2 < two >
field 3 < thr>
field 4 <ee fo>
field 5 <ur f>
field 6 <ive >
field 7 <>

(%-5s) record: <one two threefour five >

field 1 <one >
field 2 <two >
field 3 <three>
field 4 <four >
field 5 <five >
field 6 <>
field 7 <>

Assign to $$lvrefs, print record
------------------------------

(%-10s) $$lvrefs

field 1 <one >
field 2 <two >
field 3 <three>
field 4 <four >
field 5 <five >
field 6 < >
field 7 <two >

record: <one two threefour five two threefour five >

(%-5s) $$lvrefs

field 1 <one >
field 2 <two >
field 3 <three>
field 4 <four >
field 5 <five >
field 6 < >
field 7 <two >

record: <one two threefour five two threefour five >

(%-2d, 1..7) $$lvrefs

field 1 <1 two>
field 2 <2 eef>
field 3 <3 ive>
field 4 <4 tw>
field 5 <5 ree>
field 6 <6 fiv>
field 7 <7 >

record: <1 two2 eef3 ive4 tw5 ree6 fiv7 >
 
X

xhoster

Ilya Zakharevich said:
[A complimentary Cc of this posting was sent to
"Map" doesn't alias the values it returns, it copies them.

What for? Why this should not work:

perl -wle "@a=(10..15); sub z2{$_[2]=0}; z2 map $_, @a; print for @a"

It does work. It just doesn't do what you want :)

I think the current behavior is reasonable, although it would be nice to
have the other as an option (maybe Array::Splice could have a map_alias,
even though it is getting a bit far afield of the name-space).

The current behavior is consistent with what other things that conceptually
*could* keep aliasing, but don't. Like push. Or subroutines.

perl -le ' my @x=1..10; sub foo { $_[0]}; foreach (@x) {
$_="x" foreach foo($_)}; print "@x"'
1 2 3 4 5 6 7 8 9 10

Well, subroutines that have not been declared :lvalue, anyway.

On the other hand, reverse does preserve aliasing.

a do block also preserves aliasing, but for some odd reason I'm not
allowed to assign to one directly.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
X

xhoster

5.8.6 has problems:

I think it is your expectations that have problems. But you haven't
described what your expectations are, so it is hard to tell.

The lvals remember the offset and length. If the main string gets
rearranged, the lval doesn't try to change the offset and length in an
attempt to trace that rearrangement. Can you imagine the morass if they
did? Instead, it keeps the "window" the same and lets characters move
around beneath it.

Also, if you assign to the lval with a string of a different length,
the lval still remembers it's length to be the originally created one, not
the length of the string just assigned to it. So you can assign and
then immediately read and get a different thing than what you just
assigned. But I don't know that the alternative would be any less odd.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to

Simple experiments show that it is still buggy with 5.8.8: code
below returns
change_nth 7, 'lazy', map{ substr $bigScalar, $_->[0], $_->[1] }
^
\
Nope.

It will work provided you make the obvious change to change_nth to do the
dereference.

No. It would not work this way, and it would not demonstrate the bug either...

[There are many ways to work-around the bug, of course. For
example, use C instead of Perl... ;-/]

Yours,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,992
Messages
2,570,220
Members
46,805
Latest member
ClydeHeld1

Latest Threads

Top