How can I use a REGEX to process WORDS

spydox · Jan 21, 2009

I often find that I need to make a change to a block with logic like:

IN ALL START APPLE..END APPLE blocks, change NUMBER N to NUMBER N+1 if
the color is RED

Using a regex to do this seems impractical to me ( and perhaps it
really is), so in these WORD-ish cases I usually just loop through the
array lines and set flags like $inAPPLE++... Its also tricky because
often as in this case, I have to *look ahead* to check the color.
Which makes looping not so nifty either.

What I'd really prefer is a regex that sensed when I was in a START
APPLE..END APPLE clause (of course ONLY in one clause- not the very
FIRST START APPLE to the very LAST END APPLE), then

Another approach I sometimes take is to split() the blocks out, then
map the regex into the split array, then recombine. That's actually
usually pretty effective, but sort of a pain because I have to
scalarize the array, split it, map it, recombine it, and then re-array
it. Geesh!

So anyhow- I'm just wondering if this is some sort of perfunc or regex
etc I can use to assist? I feel like I'm barking up the wrong tree..
Thanks in advance gurus..

!
START APPLE
..
..
NUMBER 5
COLOR RED
..
..
END APPLE
..
!
START ORANGE
..
..
NUMBER 10
COLOR GREEN
..
END ORANGE
..
..
!
START APPLE
..
..
NUMBER 10
COLOR GREEN
..
..

END APPLE
!
START APPLE
..
..
NUMBER 2
COLOR RED
..
..
END APPLE

Steve Roscio · Jan 21, 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Is this along the lines of what you're looking for?

$a = ... #your big blob of fruit text
$fruit = "apple";
$color = "red";
@clauses = $a =~ m{start \s+ $fruit
((?: .*? ) \b color \s+ $color (?: .*? )) \b
end \s $fruit
}ixg;

In the above @clauses will contain all inner clauses that match the
fruit and color given. I know this doesn't address the substitution
part of your problem, but lets take it a step at a time... (I'm not
sure I understand your problem fully).

- - Steve

I often find that I need to make a change to a block with logic like:

IN ALL START APPLE..END APPLE blocks, change NUMBER N to NUMBER N+1 if
the color is RED

Using a regex to do this seems impractical to me ( and perhaps it
really is), so in these WORD-ish cases I usually just loop through the
array lines and set flags like $inAPPLE++... Its also tricky because
often as in this case, I have to *look ahead* to check the color.
Which makes looping not so nifty either.

What I'd really prefer is a regex that sensed when I was in a START
APPLE..END APPLE clause (of course ONLY in one clause- not the very
FIRST START APPLE to the very LAST END APPLE), then

Another approach I sometimes take is to split() the blocks out, then
map the regex into the split array, then recombine. That's actually
usually pretty effective, but sort of a pain because I have to
scalarize the array, split it, map it, recombine it, and then re-array
it. Geesh!

So anyhow- I'm just wondering if this is some sort of perfunc or regex
etc I can use to assist? I feel like I'm barking up the wrong tree..
Thanks in advance gurus..

!
START APPLE
.
.
NUMBER 5
COLOR RED
.
.
END APPLE
.
!
START ORANGE
.
.
NUMBER 10
COLOR GREEN
.
END ORANGE
.
.
!
START APPLE
.
.
NUMBER 10
COLOR GREEN
.
.

END APPLE
!
START APPLE
.
.
NUMBER 2
COLOR RED
.
.
END APPLE

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkl3mf0ACgkQb3tX8F/f+Wy3YQCffjXyYzfO/+gUAJ5CPao7RlVZ
sLgAn1ef/R6D+fQgJyw7tYU8gdM6m1WQ
=3VLf
-----END PGP SIGNATURE-----

A. Sinan Unur · Jan 21, 2009

(e-mail address removed) wrote in @m12g2000vbp.googlegroups.com:

I often find that I need to make a change to a block with logic like:

IN ALL START APPLE..END APPLE blocks, change NUMBER N to NUMBER N+1 if
the color is RED

Using a regex to do this seems impractical to me ( and perhaps it
really is), so in these WORD-ish cases I usually just loop through the
array lines and set flags like $inAPPLE++... Its also tricky because
often as in this case, I have to *look ahead* to check the color.
Which makes looping not so nifty either.

What I'd really prefer is a regex that sensed when I was in a START
APPLE..END APPLE clause (of course ONLY in one clause- not the very
FIRST START APPLE to the very LAST END APPLE), then

Another approach I sometimes take is to split() the blocks out, then
map the regex into the split array, then recombine. That's actually
usually pretty effective, but sort of a pain because I have to
scalarize the array, split it, map it, recombine it, and then re-array
it. Geesh!

So anyhow- I'm just wondering if this is some sort of perfunc or regex
etc I can use to assist? I feel like I'm barking up the wrong tree..
Thanks in advance gurus..

I am not sure anything is really wrong with the approaches listed above.
I would be inclined to parse the data into proper Perl data structures
(in this case an array of hashrefs seems reasonable). However, it seems
to me like you are not familiar with the range operator .. so I came up
with the following example using your data:

#!/usr/bin/perl

use strict;
use warnings;

my ($number, $color, $buffer);

while ( <DATA> ) {
print and next unless my $seq = /^START APPLE/ .. /^END APPLE/;

$buffer .= $_;
$number = $1 if /^NUMBER (\d+)/;
$color = $1 if /^COLOR (\w+)/;

if ( $seq =~ /E0$/ ) {

if ( $color eq 'RED') {
my $replacement = $number + 1;
$buffer =~ s/^NUMBER $number/NUMBER $replacement/m;
}

print $buffer;
undef for $number, $color, $buffer;
}
}

__DATA__
!
START APPLE
..
..
NUMBER 5
COLOR RED
..
..
END APPLE
..
!
START ORANGE
..
..
NUMBER 10
COLOR GREEN
..
END ORANGE
..
..
!
START APPLE
..
..
NUMBER 10
COLOR GREEN
..
..

END APPLE
!
START APPLE
..
..
NUMBER 2
COLOR RED
..
..
END APPLE

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

spydox · Jan 22, 2009

Is this along the lines of what you're looking for?

$a = ... #your big blob of fruit text
$fruit = "apple";
$color = "red";
@clauses = $a =~ m{start \s+ $fruit
((?: .*? ) \b color \s+ $color (?: .*? )) \b
end \s $fruit
}ixg;

In the above @clauses will contain all inner clauses that match the
fruit and color given. I know this doesn't address the substitution
part of your problem, but lets take it a step at a time... (I'm not
sure I understand your problem fully).

- - Steve

Wow - yes I think you are barking up the RIGHT tree! That does look
appealing. The *inner clause* concept is really the concept I was
after. The subs I can manage I think by extending your regex a bit..

Thank-You both for taking the time to reply, both answers got me
thinking about a new approach.

Cheers,
S

spydox · Jan 22, 2009

(e-mail address removed) wrote in @m12g2000vbp.googlegroups.com:

#!/usr/bin/perl

use strict;
use warnings;

my ($number, $color, $buffer);

while ( <DATA> ) {
print and next unless my $seq = /^START APPLE/ .. /^END APPLE/;

$buffer .= $_;
$number = $1 if /^NUMBER (\d+)/;
$color = $1 if /^COLOR (\w+)/;

if ( $seq =~ /E0$/ ) {

if ( $color eq 'RED') {
my $replacement = $number + 1;
$buffer =~ s/^NUMBER $number/NUMBER $replacement/m;
}

print $buffer;
undef for $number, $color, $buffer;
}

}

Sinan

I'm not sure this will work since the COLOR is undef until its too
late? It comes later in the clause?

I really like your construct: undef for $number, $color, $buffer;

is that the same (effecively) as

( $number, $color, $buffer ) = (undef, undef, undef );

?

Also is, as you stated it,

$number, $color, $buffer

a list? If so why doesn't it require parens?

S

A. Sinan Unur · Jan 22, 2009

(e-mail address removed) wrote in @p36g2000prp.googlegroups.com:

....

I'm not sure this will work since the COLOR is undef until its too
late? It comes later in the clause?

It does not matter. The program does not enter the body of the

if ( $seq =~ /E0$/ )

statement until after it has seen all relevant information. Run the
program (in a debugger if necessary) to see how it works.

I really like your construct: undef for $number, $color, $buffer;

is that the same (effecively) as

( $number, $color, $buffer ) = (undef, undef, undef );

Strictly speaking, it is the same as:

undef $number;
undef $color;
undef $buffer;

Also is, as you stated it,

$number, $color, $buffer

a list? If so why doesn't it require parens?

It is standard for loop syntax. No different than:

say for 1, 2, 3;

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

Tad J McClellan · Jan 22, 2009

Also is, as you stated it,

$number, $color, $buffer

a list?

Yes.

If so why doesn't it require parens?

Because that's what the "Statement Modifiers" section in

perldoc perlsyn

says

(also because parens do not make a list!)

Uri Guttman · Jan 22, 2009

s> I really like your construct: undef for $number, $color, $buffer;

you shouldn't like that construct. it is slower and clunkier than
needed. my rule is to avoid explicit use of the undef function as much
as possible. this is for style reasons and undef is easily avoidable

s> is that the same (effecively) as

s> ( $number, $color, $buffer ) = (undef, undef, undef );

why the extra redundant undef's?

( $number, $color, $buffer ) = () ;

that is faster than the for loop and shorter than the explicit undef
list.

but i suspect there isn't a need for undefing like that. whenever i see
a bunch of that sort of code i smell a poor design of the loop or
whatever. designing things so you don't need that is not difficult and
you can use scope, sub return, etc. to handle things without explicit
calls to undef.

uri

sln · Jan 22, 2009

I often find that I need to make a change to a block with logic like:

IN ALL START APPLE..END APPLE blocks, change NUMBER N to NUMBER N+1 if
the color is RED

Using a regex to do this seems impractical to me ( and perhaps it
really is), so in these WORD-ish cases I usually just loop through the
array lines and set flags like $inAPPLE++... Its also tricky because
often as in this case, I have to *look ahead* to check the color.
Which makes looping not so nifty either.

What I'd really prefer is a regex that sensed when I was in a START
APPLE..END APPLE clause (of course ONLY in one clause- not the very
FIRST START APPLE to the very LAST END APPLE), then

Another approach I sometimes take is to split() the blocks out, then
map the regex into the split array, then recombine. That's actually
usually pretty effective, but sort of a pain because I have to
scalarize the array, split it, map it, recombine it, and then re-array
it. Geesh!

So anyhow- I'm just wondering if this is some sort of perfunc or regex
etc I can use to assist? I feel like I'm barking up the wrong tree..
Thanks in advance gurus..

[snip data]

I guess this is one way to do it if you don't mind zero-width assertions.

sln

-------------------------------
use strict;
use warnings;

my $txt = join '', <DATA>;

my $delim1 = "START APPLE";
my $delim2 = "END APPLE";
my $color = "COLOR RED";
my $number = "NUMBER";
my $cntr = 0;

sub getChange
{
return ($_[0] =~ /\s$color\s/i and $_[0] =~ s/(\s$number\s)(\d+)(\s)/$1.($2+1).$3/ie);
}

$txt =~ s/((?:\s|^)$delim1(?=\s))((?

?!\s$delim1\s).)*?\s)($delim2(?:\s|$))/
my $chunk = "$1$2$3";
if (getChange($chunk)) { ++$cntr; print $chunk."\n"; }
$chunk;
/isge;

if ($cntr > 0)
{
print "\nChanged $cntr block(s)\n";
} else {
print "\nDid not change any blocks, count = $cntr\n";
}
print "\n----------------------\nTXT:\n'$txt'\n";

__DATA__

!
START APPLE
..
..
NUMBER 5
COLOR RED
..
..
END APPLE
..
!
START ORANGE
..
..
NUMBER 10
COLOR GREEN
..
END ORANGE
..
..
!
START APPLE
..
..
NUMBER 10
COLOR GREEN
..
..

END APPLE
!
START APPLE
..
..
NUMBER 2
COLOR RED
..
..
END APPLE

A. Sinan Unur · Jan 22, 2009

s> I really like your construct: undef for $number, $color, $buffer;

you shouldn't like that construct. it is slower and clunkier than
....

but i suspect there isn't a need for undefing like that.

No there isn't. I was being careless. Thanks for the correction.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

sln · Jan 22, 2009

(e-mail address removed) wrote in @m12g2000vbp.googlegroups.com:

....
[snip OP statement]

I am not sure anything is really wrong with the approaches listed above.
I would be inclined to parse the data into proper Perl data structures
(in this case an array of hashrefs seems reasonable). However, it seems
to me like you are not familiar with the range operator .. so I came up
with the following example using your data:

#!/usr/bin/perl

use strict;
use warnings;

my ($number, $color, $buffer);

while ( <DATA> ) {
print and next unless my $seq = /^START APPLE/ .. /^END APPLE/;

$buffer .= $_;
$number = $1 if /^NUMBER (\d+)/;
$color = $1 if /^COLOR (\w+)/;

if ( $seq =~ /E0$/ ) {

if ( $color eq 'RED') {

^^
if ( $color && $number && $color eq 'RED') {

my $replacement = $number + 1;
$buffer =~ s/^NUMBER $number/NUMBER $replacement/m;
}

print $buffer;
undef for $number, $color, $buffer;

^^^^^ funny this doesen't undefine $_ in my Perl

# $_ = undef for $number, $color, $buffer;
# or
($number, $color, $buffer) = ();

# Doesen't matter if they are undefined, but they should be set to a cleared state
# here in some way, $buffer for sure. Un-matched NUMBER or COLOR within a sequence block
# will either kick out with an error or will use matches from a previous block.
# I guess this isn't nit-picking, just error prevention.

}
}

sln

# test data:

__DATA__
!
START APPLE
..
..
NUMBER 5a
COLOR RED
..
..
END APPLE
..
!
START APPLE
..
..
NUMBER 10
COLOR
..
..

END APPLE
!
START APPLE
..
..
NUMBER 2
COLOR RED
..
..
END APPLE

Eric Pozharski · Jan 22, 2009

s> I really like your construct: undef for $number, $color, $buffer;

you shouldn't like that construct. it is slower and clunkier than
needed. my rule is to avoid explicit use of the undef function as much
as possible. this is for style reasons and undef is easily avoidable

s> is that the same (effecively) as

s> ( $number, $color, $buffer ) = (undef, undef, undef );

why the extra redundant undef's?

( $number, $color, $buffer ) = () ;

that is faster than the for loop and shorter than the explicit undef
list.

perl -wle '
use Benchmark qw|cmpthese timethese|;
my($x, $y, $z);
my $t = timethese -10, {
for => sub { undef for $x, $y, $z },
list => sub { ($x, $y, $z) = (undef, undef, undef) },
seq => sub { undef $x; undef $y; undef $z },
empty => sub { ($x, $y, $z) = () }, };
cmpthese $t;
'
Benchmark:
running
empty, for, list, seq
for at least 10 CPU seconds
....

empty: 10 wallclock secs (10.01 usr + 0.01 sys = 10.02 CPU) @
1152581.04/s (n=11548862)

for: 13 wallclock secs (10.49 usr + 0.05 sys = 10.54 CPU) @
291213.00/s (n=3069385)

list: 13 wallclock secs (10.71 usr + 0.02 sys = 10.73 CPU) @
959226.75/s (n=10292503)

seq: 12 wallclock secs ( 9.95 usr + 0.05 sys = 10.00 CPU) @
1751871.70/s (n=17518717)

Rate for list empty seq
for 291213/s -- -70% -75% -83%
list 959227/s 229% -- -17% -45%
empty 1152581/s 296% 20% -- -34%
seq 1751872/s 502% 83% 52% --

And look at that 0.05 sys for I<seq> -- it does something. Just
rechecked -- results aren't that stable

Rate for list empty seq
for 283340/s -- -71% -76% -79%
list 963256/s 240% -- -18% -30%
empty 1167856/s 312% 21% -- -15%
seq 1371866/s 384% 42% 17% --

but i suspect there isn't a need for undefing like that. whenever i see
a bunch of that sort of code i smell a poor design of the loop or
whatever. designing things so you don't need that is not difficult and
you can use scope, sub return, etc. to handle things without explicit
calls to undef.

Though I should admit, that I would stay with Uri's idea (whatever name
of that idea is).

Uri Guttman · Jan 23, 2009

EP> Though I should admit, that I would stay with Uri's idea (whatever name
EP> of that idea is).

call it the "don't use undef unless you must" rule.

similar to the string eval and symrefs rules i have posted many times -
don't use them unless you know when not to use them!

uri

A. Sinan Unur · Jan 23, 2009

EP> Though I should admit, that I would stay with Uri's idea
(whatever name EP> of that idea is).

call it the "don't use undef unless you must" rule.

Would you believe that post where my main purpose was to come up with a
quick example of using the range operator to solve the OP's problem was
the first time I undef'ed a bunch of variables like that?

similar to the string eval and symrefs rules i have posted many times
- don't use them unless you know when not to use them!

Good, now my name is mentioned along with string eval and symrefs. I'll
never recover from that ;-)

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

Tad J McClellan · Jan 23, 2009

Uri Guttman said:
EP> Though I should admit, that I would stay with Uri's idea (whatever name
EP> of that idea is).

call it the "don't use undef unless you must" rule.

It needs parens with it to distinguish the undef operator (avoid) from
the undef value (ok).

Call it the "undef() is unwanted" rule.

"undef() is unwarranted"?

"undef() is ungood"?

"undef() should be underused"?

Eric Pozharski · Jan 24, 2009

My point was aimed to OP (though useles, I think).

Would you believe that post where my main purpose was to come up with a
quick example of using the range operator to solve the OP's problem was
the first time I undef'ed a bunch of variables like that?

Good, now my name is mentioned along with string eval and symrefs. I'll
never recover from that ;-)

Sorry, it was by accident

How can I remove the specific error message after regex validation is done?	1	Jan 25, 2023
I cannot find a suitable guide on how to use jzy3d, can anyone help me?	1	Jan 5, 2024
How can I get my menu inside of a menu to function properly?	1	Jan 19, 2023
How can I fix my pattern coding error in c++	0	Mar 19, 2023
How to use Densenet121 in monai	0	Feb 16, 2024
Help for ActionPerformance and how to use HashMap.	2	Feb 10, 2022
GET NEIL DEGRASSES TYSON, I ripped a hole with this one...	0	Nov 10, 2022
How can I structure the final array to meet the requirements of Bootstrap Tree View for building a tree in JavaScript?	1	Mar 29, 2024

How can I use a REGEX to process WORDS

spydox

Steve Roscio

A. Sinan Unur

spydox

spydox

A. Sinan Unur

Tad J McClellan

Uri Guttman

sln

A. Sinan Unur

sln

Eric Pozharski

Uri Guttman

A. Sinan Unur

Tad J McClellan

Eric Pozharski

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads