How can I use a REGEX to process WORDS

S

spydox

I often find that I need to make a change to a block with logic like:

IN ALL START APPLE..END APPLE blocks, change NUMBER N to NUMBER N+1 if
the color is RED

Using a regex to do this seems impractical to me ( and perhaps it
really is), so in these WORD-ish cases I usually just loop through the
array lines and set flags like $inAPPLE++... Its also tricky because
often as in this case, I have to *look ahead* to check the color.
Which makes looping not so nifty either.

What I'd really prefer is a regex that sensed when I was in a START
APPLE..END APPLE clause (of course ONLY in one clause- not the very
FIRST START APPLE to the very LAST END APPLE), then

Another approach I sometimes take is to split() the blocks out, then
map the regex into the split array, then recombine. That's actually
usually pretty effective, but sort of a pain because I have to
scalarize the array, split it, map it, recombine it, and then re-array
it. Geesh!

So anyhow- I'm just wondering if this is some sort of perfunc or regex
etc I can use to assist? I feel like I'm barking up the wrong tree..
Thanks in advance gurus..

!
START APPLE
..
..
NUMBER 5
COLOR RED
..
..
END APPLE
..
!
START ORANGE
..
..
NUMBER 10
COLOR GREEN
..
END ORANGE
..
..
!
START APPLE
..
..
NUMBER 10
COLOR GREEN
..
..

END APPLE
!
START APPLE
..
..
NUMBER 2
COLOR RED
..
..
END APPLE
 
S

Steve Roscio

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Is this along the lines of what you're looking for?

$a = ... #your big blob of fruit text
$fruit = "apple";
$color = "red";
@clauses = $a =~ m{start \s+ $fruit
((?: .*? ) \b color \s+ $color (?: .*? )) \b
end \s $fruit
}ixg;

In the above @clauses will contain all inner clauses that match the
fruit and color given. I know this doesn't address the substitution
part of your problem, but lets take it a step at a time... (I'm not
sure I understand your problem fully).

- - Steve

I often find that I need to make a change to a block with logic like:

IN ALL START APPLE..END APPLE blocks, change NUMBER N to NUMBER N+1 if
the color is RED

Using a regex to do this seems impractical to me ( and perhaps it
really is), so in these WORD-ish cases I usually just loop through the
array lines and set flags like $inAPPLE++... Its also tricky because
often as in this case, I have to *look ahead* to check the color.
Which makes looping not so nifty either.

What I'd really prefer is a regex that sensed when I was in a START
APPLE..END APPLE clause (of course ONLY in one clause- not the very
FIRST START APPLE to the very LAST END APPLE), then

Another approach I sometimes take is to split() the blocks out, then
map the regex into the split array, then recombine. That's actually
usually pretty effective, but sort of a pain because I have to
scalarize the array, split it, map it, recombine it, and then re-array
it. Geesh!

So anyhow- I'm just wondering if this is some sort of perfunc or regex
etc I can use to assist? I feel like I'm barking up the wrong tree..
Thanks in advance gurus..

!
START APPLE
.
.
NUMBER 5
COLOR RED
.
.
END APPLE
.
!
START ORANGE
.
.
NUMBER 10
COLOR GREEN
.
END ORANGE
.
.
!
START APPLE
.
.
NUMBER 10
COLOR GREEN
.
.

END APPLE
!
START APPLE
.
.
NUMBER 2
COLOR RED
.
.
END APPLE

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkl3mf0ACgkQb3tX8F/f+Wy3YQCffjXyYzfO/+gUAJ5CPao7RlVZ
sLgAn1ef/R6D+fQgJyw7tYU8gdM6m1WQ
=3VLf
-----END PGP SIGNATURE-----
 
A

A. Sinan Unur

(e-mail address removed) wrote in @m12g2000vbp.googlegroups.com:
I often find that I need to make a change to a block with logic like:

IN ALL START APPLE..END APPLE blocks, change NUMBER N to NUMBER N+1 if
the color is RED

Using a regex to do this seems impractical to me ( and perhaps it
really is), so in these WORD-ish cases I usually just loop through the
array lines and set flags like $inAPPLE++... Its also tricky because
often as in this case, I have to *look ahead* to check the color.
Which makes looping not so nifty either.

What I'd really prefer is a regex that sensed when I was in a START
APPLE..END APPLE clause (of course ONLY in one clause- not the very
FIRST START APPLE to the very LAST END APPLE), then

Another approach I sometimes take is to split() the blocks out, then
map the regex into the split array, then recombine. That's actually
usually pretty effective, but sort of a pain because I have to
scalarize the array, split it, map it, recombine it, and then re-array
it. Geesh!

So anyhow- I'm just wondering if this is some sort of perfunc or regex
etc I can use to assist? I feel like I'm barking up the wrong tree..
Thanks in advance gurus..

I am not sure anything is really wrong with the approaches listed above.
I would be inclined to parse the data into proper Perl data structures
(in this case an array of hashrefs seems reasonable). However, it seems
to me like you are not familiar with the range operator .. so I came up
with the following example using your data:

#!/usr/bin/perl

use strict;
use warnings;

my ($number, $color, $buffer);

while ( <DATA> ) {
print and next unless my $seq = /^START APPLE/ .. /^END APPLE/;

$buffer .= $_;
$number = $1 if /^NUMBER (\d+)/;
$color = $1 if /^COLOR (\w+)/;

if ( $seq =~ /E0$/ ) {

if ( $color eq 'RED') {
my $replacement = $number + 1;
$buffer =~ s/^NUMBER $number/NUMBER $replacement/m;
}

print $buffer;
undef for $number, $color, $buffer;
}
}


__DATA__
!
START APPLE
..
..
NUMBER 5
COLOR RED
..
..
END APPLE
..
!
START ORANGE
..
..
NUMBER 10
COLOR GREEN
..
END ORANGE
..
..
!
START APPLE
..
..
NUMBER 10
COLOR GREEN
..
..

END APPLE
!
START APPLE
..
..
NUMBER 2
COLOR RED
..
..
END APPLE

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
S

spydox

Is this along the lines of what you're looking for?

 $a = ... #your big blob of fruit text
 $fruit = "apple";
 $color = "red";
 @clauses = $a =~ m{start \s+ $fruit
                    ((?: .*? ) \b color \s+ $color (?: .*? )) \b
                    end \s $fruit
                   }ixg;


In the above @clauses will contain all inner clauses that match the
fruit and color given.  I know this doesn't address the substitution
part of your problem, but lets take it a step at a time... (I'm not
sure I understand your problem fully).

- - Steve

Wow - yes I think you are barking up the RIGHT tree! That does look
appealing. The *inner clause* concept is really the concept I was
after. The subs I can manage I think by extending your regex a bit..

Thank-You both for taking the time to reply, both answers got me
thinking about a new approach.

Cheers,
S
 
S

spydox

(e-mail address removed) wrote in @m12g2000vbp.googlegroups.com:



#!/usr/bin/perl

use strict;
use warnings;

my ($number, $color, $buffer);

while ( <DATA> ) {
    print and next unless my $seq = /^START APPLE/ .. /^END APPLE/;

    $buffer .= $_;
    $number  = $1 if /^NUMBER (\d+)/;
    $color   = $1 if /^COLOR (\w+)/;

    if ( $seq =~ /E0$/ ) {

        if ( $color eq 'RED') {
            my $replacement = $number + 1;
            $buffer =~ s/^NUMBER $number/NUMBER $replacement/m;
        }

        print $buffer;
        undef for $number, $color, $buffer;
    }

}


Sinan

I'm not sure this will work since the COLOR is undef until its too
late? It comes later in the clause?

I really like your construct: undef for $number, $color, $buffer;

is that the same (effecively) as

( $number, $color, $buffer ) = (undef, undef, undef );

?

Also is, as you stated it,

$number, $color, $buffer

a list? If so why doesn't it require parens?

S
 
A

A. Sinan Unur

(e-mail address removed) wrote in @p36g2000prp.googlegroups.com:
....

I'm not sure this will work since the COLOR is undef until its too
late? It comes later in the clause?

It does not matter. The program does not enter the body of the

if ( $seq =~ /E0$/ )

statement until after it has seen all relevant information. Run the
program (in a debugger if necessary) to see how it works.
I really like your construct: undef for $number, $color, $buffer;

is that the same (effecively) as

( $number, $color, $buffer ) = (undef, undef, undef );

Strictly speaking, it is the same as:

undef $number;
undef $color;
undef $buffer;
Also is, as you stated it,

$number, $color, $buffer

a list? If so why doesn't it require parens?

It is standard for loop syntax. No different than:

say for 1, 2, 3;

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
T

Tad J McClellan

Also is, as you stated it,

$number, $color, $buffer

a list?

Yes.


If so why doesn't it require parens?


Because that's what the "Statement Modifiers" section in

perldoc perlsyn

says :)


(also because parens do not make a list!)
 
U

Uri Guttman

s> I really like your construct: undef for $number, $color, $buffer;

you shouldn't like that construct. it is slower and clunkier than
needed. my rule is to avoid explicit use of the undef function as much
as possible. this is for style reasons and undef is easily avoidable

s> is that the same (effecively) as

s> ( $number, $color, $buffer ) = (undef, undef, undef );

why the extra redundant undef's?

( $number, $color, $buffer ) = () ;

that is faster than the for loop and shorter than the explicit undef
list.

but i suspect there isn't a need for undefing like that. whenever i see
a bunch of that sort of code i smell a poor design of the loop or
whatever. designing things so you don't need that is not difficult and
you can use scope, sub return, etc. to handle things without explicit
calls to undef.

uri
 
S

sln

I often find that I need to make a change to a block with logic like:

IN ALL START APPLE..END APPLE blocks, change NUMBER N to NUMBER N+1 if
the color is RED

Using a regex to do this seems impractical to me ( and perhaps it
really is), so in these WORD-ish cases I usually just loop through the
array lines and set flags like $inAPPLE++... Its also tricky because
often as in this case, I have to *look ahead* to check the color.
Which makes looping not so nifty either.

What I'd really prefer is a regex that sensed when I was in a START
APPLE..END APPLE clause (of course ONLY in one clause- not the very
FIRST START APPLE to the very LAST END APPLE), then

Another approach I sometimes take is to split() the blocks out, then
map the regex into the split array, then recombine. That's actually
usually pretty effective, but sort of a pain because I have to
scalarize the array, split it, map it, recombine it, and then re-array
it. Geesh!

So anyhow- I'm just wondering if this is some sort of perfunc or regex
etc I can use to assist? I feel like I'm barking up the wrong tree..
Thanks in advance gurus..
[snip data]

I guess this is one way to do it if you don't mind zero-width assertions.

sln

-------------------------------
use strict;
use warnings;

my $txt = join '', <DATA>;

my $delim1 = "START APPLE";
my $delim2 = "END APPLE";
my $color = "COLOR RED";
my $number = "NUMBER";
my $cntr = 0;

sub getChange
{
return ($_[0] =~ /\s$color\s/i and $_[0] =~ s/(\s$number\s)(\d+)(\s)/$1.($2+1).$3/ie);
}

$txt =~ s/((?:\s|^)$delim1(?=\s))((?:(?!\s$delim1\s).)*?\s)($delim2(?:\s|$))/
my $chunk = "$1$2$3";
if (getChange($chunk)) { ++$cntr; print $chunk."\n"; }
$chunk;
/isge;

if ($cntr > 0)
{
print "\nChanged $cntr block(s)\n";
} else {
print "\nDid not change any blocks, count = $cntr\n";
}
print "\n----------------------\nTXT:\n'$txt'\n";



__DATA__


!
START APPLE
..
..
NUMBER 5
COLOR RED
..
..
END APPLE
..
!
START ORANGE
..
..
NUMBER 10
COLOR GREEN
..
END ORANGE
..
..
!
START APPLE
..
..
NUMBER 10
COLOR GREEN
..
..

END APPLE
!
START APPLE
..
..
NUMBER 2
COLOR RED
..
..
END APPLE
 
A

A. Sinan Unur

s> I really like your construct: undef for $number, $color, $buffer;

you shouldn't like that construct. it is slower and clunkier than
....

but i suspect there isn't a need for undefing like that.

No there isn't. I was being careless. Thanks for the correction.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
S

sln

(e-mail address removed) wrote in @m12g2000vbp.googlegroups.com:
....
[snip OP statement]
I am not sure anything is really wrong with the approaches listed above.
I would be inclined to parse the data into proper Perl data structures
(in this case an array of hashrefs seems reasonable). However, it seems
to me like you are not familiar with the range operator .. so I came up
with the following example using your data:

#!/usr/bin/perl

use strict;
use warnings;

my ($number, $color, $buffer);

while ( <DATA> ) {
print and next unless my $seq = /^START APPLE/ .. /^END APPLE/;

$buffer .= $_;
$number = $1 if /^NUMBER (\d+)/;
$color = $1 if /^COLOR (\w+)/;

if ( $seq =~ /E0$/ ) {

if ( $color eq 'RED') {
^^
if ( $color && $number && $color eq 'RED') {
my $replacement = $number + 1;
$buffer =~ s/^NUMBER $number/NUMBER $replacement/m;
}

print $buffer;
undef for $number, $color, $buffer;
^^^^^ funny this doesen't undefine $_ in my Perl

# $_ = undef for $number, $color, $buffer;
# or
($number, $color, $buffer) = ();

# Doesen't matter if they are undefined, but they should be set to a cleared state
# here in some way, $buffer for sure. Un-matched NUMBER or COLOR within a sequence block
# will either kick out with an error or will use matches from a previous block.
# I guess this isn't nit-picking, just error prevention.

sln

# test data:

__DATA__
!
START APPLE
..
..
NUMBER 5a
COLOR RED
..
..
END APPLE
..
!
START APPLE
..
..
NUMBER 10
COLOR
..
..

END APPLE
!
START APPLE
..
..
NUMBER 2
COLOR RED
..
..
END APPLE
 
E

Eric Pozharski

s> I really like your construct: undef for $number, $color, $buffer;

you shouldn't like that construct. it is slower and clunkier than
needed. my rule is to avoid explicit use of the undef function as much
as possible. this is for style reasons and undef is easily avoidable

s> is that the same (effecively) as

s> ( $number, $color, $buffer ) = (undef, undef, undef );

why the extra redundant undef's?

( $number, $color, $buffer ) = () ;

that is faster than the for loop and shorter than the explicit undef
list.

perl -wle '
use Benchmark qw|cmpthese timethese|;
my($x, $y, $z);
my $t = timethese -10, {
for => sub { undef for $x, $y, $z },
list => sub { ($x, $y, $z) = (undef, undef, undef) },
seq => sub { undef $x; undef $y; undef $z },
empty => sub { ($x, $y, $z) = () }, };
cmpthese $t;
'
Benchmark:
running
empty, for, list, seq
for at least 10 CPU seconds
....

empty: 10 wallclock secs (10.01 usr + 0.01 sys = 10.02 CPU) @
1152581.04/s (n=11548862)

for: 13 wallclock secs (10.49 usr + 0.05 sys = 10.54 CPU) @
291213.00/s (n=3069385)

list: 13 wallclock secs (10.71 usr + 0.02 sys = 10.73 CPU) @
959226.75/s (n=10292503)

seq: 12 wallclock secs ( 9.95 usr + 0.05 sys = 10.00 CPU) @
1751871.70/s (n=17518717)

Rate for list empty seq
for 291213/s -- -70% -75% -83%
list 959227/s 229% -- -17% -45%
empty 1152581/s 296% 20% -- -34%
seq 1751872/s 502% 83% 52% --

And look at that 0.05 sys for I<seq> -- it does something. Just
rechecked -- results aren't that stable

Rate for list empty seq
for 283340/s -- -71% -76% -79%
list 963256/s 240% -- -18% -30%
empty 1167856/s 312% 21% -- -15%
seq 1371866/s 384% 42% 17% --
but i suspect there isn't a need for undefing like that. whenever i see
a bunch of that sort of code i smell a poor design of the loop or
whatever. designing things so you don't need that is not difficult and
you can use scope, sub return, etc. to handle things without explicit
calls to undef.

Though I should admit, that I would stay with Uri's idea (whatever name
of that idea is).
 
U

Uri Guttman

EP> Though I should admit, that I would stay with Uri's idea (whatever name
EP> of that idea is).

call it the "don't use undef unless you must" rule. :)

similar to the string eval and symrefs rules i have posted many times -
don't use them unless you know when not to use them!

uri
 
A

A. Sinan Unur

EP> Though I should admit, that I would stay with Uri's idea
(whatever name EP> of that idea is).

call it the "don't use undef unless you must" rule. :)

Would you believe that post where my main purpose was to come up with a
quick example of using the range operator to solve the OP's problem was
the first time I undef'ed a bunch of variables like that?
similar to the string eval and symrefs rules i have posted many times
- don't use them unless you know when not to use them!

Good, now my name is mentioned along with string eval and symrefs. I'll
never recover from that ;-)

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
T

Tad J McClellan

Uri Guttman said:
EP> Though I should admit, that I would stay with Uri's idea (whatever name
EP> of that idea is).

call it the "don't use undef unless you must" rule. :)


It needs parens with it to distinguish the undef operator (avoid) from
the undef value (ok).


Call it the "undef() is unwanted" rule.

"undef() is unwarranted"?

"undef() is ungood"?

"undef() should be underused"?


:)
 
E

Eric Pozharski

My point was aimed to OP (though useles, I think).
Would you believe that post where my main purpose was to come up with a
quick example of using the range operator to solve the OP's problem was
the first time I undef'ed a bunch of variables like that?


Good, now my name is mentioned along with string eval and symrefs. I'll
never recover from that ;-)

Sorry, it was by accident :(
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,996
Messages
2,570,238
Members
46,826
Latest member
robinsontor

Latest Threads

Top