Requesting regular expression help

D

David G

I'm not any kind or perl expert, but have still managed to inherit a
number of perl scripts in a project. The original author is long
gone, and we don't have any perl expertise in house.

The problem is that one of the scripts reads in a text file, and
performs a translation operation on strings it finds enclosed in
double quotes. The perl in question is as follows:

{
open(INPUT, "$infile") or die "Can't open $infile: $!\n";
local $/ = undef;
$inputText = <INPUT>;
close(INPUT);
}

$inputText =~ s/"([^"]*)"/Translate(*TRANS, $1, "Generic String")/ieg;

OpenMakeDir(*OUTPUT, "$outfile") or die "Can't write $outfile \n";
print OUTPUT $inputText;
close(OUTPUT);

As far as I can tell, the entire operation is done in one go by the
line that starts "$input =~ ..."

The problem is that we need to have the operation not be performed if
the quoted string is itself enclosed in double angle brackets.

"This string is translated"
<<..."This string isn't"...>>

shows what is needed. If it makes any difference, angle bracket
enclosed sequences will never cross a line boundary.

For what it's worth I got the job because I'd mentioned that I do know
awk reasonably well. Ironically, this would be trivial for me to
craft in awk, but that doesn't help here.

So, is there any way to get the behavior we want? If it means
processing each line of the file in isolation, I'm 100% on board with
that, I just need to find a working solution.

TIA
David G.
 
S

Steve C

David said:
I'm not any kind or perl expert, but have still managed to inherit a
number of perl scripts in a project. The original author is long
gone, and we don't have any perl expertise in house.

The problem is that one of the scripts reads in a text file, and
performs a translation operation on strings it finds enclosed in
double quotes. The perl in question is as follows:

{
open(INPUT, "$infile") or die "Can't open $infile: $!\n";
local $/ = undef;
$inputText = <INPUT>;
close(INPUT);
}

$inputText =~ s/"([^"]*)"/Translate(*TRANS, $1, "Generic String")/ieg;

OpenMakeDir(*OUTPUT, "$outfile") or die "Can't write $outfile \n";
print OUTPUT $inputText;
close(OUTPUT);

As far as I can tell, the entire operation is done in one go by the
line that starts "$input =~ ..."

The problem is that we need to have the operation not be performed if
the quoted string is itself enclosed in double angle brackets.

"This string is translated"
<<..."This string isn't"...>>

shows what is needed. If it makes any difference, angle bracket
enclosed sequences will never cross a line boundary.

For what it's worth I got the job because I'd mentioned that I do know
awk reasonably well. Ironically, this would be trivial for me to
craft in awk, but that doesn't help here.

So, is there any way to get the behavior we want? If it means
processing each line of the file in isolation, I'm 100% on board with
that, I just need to find a working solution.

Nesting is not something that you can do in the general case in a regexp.
You could change quotes within angle brackets to something else that doesn't
match and also doesn't appear in your text, then do your existing regexp,
then change them back.
 
J

Jim Gibson

David G said:
I'm not any kind or perl expert, but have still managed to inherit a
number of perl scripts in a project. The original author is long
gone, and we don't have any perl expertise in house.

The problem is that one of the scripts reads in a text file, and
performs a translation operation on strings it finds enclosed in
double quotes. The perl in question is as follows:

{
open(INPUT, "$infile") or die "Can't open $infile: $!\n";
local $/ = undef;
$inputText = <INPUT>;
close(INPUT);
}

$inputText =~ s/"([^"]*)"/Translate(*TRANS, $1, "Generic String")/ieg;

OpenMakeDir(*OUTPUT, "$outfile") or die "Can't write $outfile \n";
print OUTPUT $inputText;
close(OUTPUT);

As far as I can tell, the entire operation is done in one go by the
line that starts "$input =~ ..."

The problem is that we need to have the operation not be performed if
the quoted string is itself enclosed in double angle brackets.

"This string is translated"
<<..."This string isn't"...>>

See the advice given by 'perldoc -q balanced' "Can I use Perl regular
expressions to match balanced text?" This has changed in Perl 5.10.

Consider using the Text::Balanced module.
 
U

Uri Guttman

BM> or actually I'd just use the File::Slurp module from CPAN:

BM> use File::Slurp qw/read_file/;

no need to explicitly import read_file. it and write_file are exported
by default.

uri
 
U

Uri Guttman

TM> He isn't explicitly importing read_file...



TM> ... he is suppressing the importing of write_file. :)

yeah, i know that. but that isn't so harmful to default to the
imports. not that you would create a conflicting write_file sub too often.

TM> I nearly always explicitly name only the funcs I plan to use like that.

depends on the module and how it sets up its export lists and such.

uri
 
S

sln

Quoth Tad McClellan said:
Match *either* a double-brackety string or a double-quoted
string, and if it is double-brackety just put it back in unchanged:

$inputText =~ s{
(<<.+?>>) # double brackety
|
"([^"]*)" # double quotey
}
{$1 ? # IF there's anything in $1

It's worth pointing out that in general you need 'defined $1' here. It's
safe in this case, since a string containing '<<>>' cannot be false, but
^^^^^^^^^^
This would seem to be understated since ($1 being a string) only the
lone character '0' and undef will be false for condition if( $1 ){};

But, defined $1 is the intention and its easier to read.
This condition is a trap though, one would want to block
against a while($var = '0') condition since '0' might be
valid as text data, not to be discarded (depending on usage).

Kind of odd, only if the scalar is marked undefined
or is the code for '0' is it false. Not false for '00'.

print "'00' yes \n" if ('00');
print "'0' no \n" if (not '0');
print "undef no \n" if (not undef);
print "'a' yes \n" if ('a');

while(not $var = '0') {
print "while (not \$var = '0'){} yes\n";
last;
}

I probably got this all wrong but have been bitten
by this before, so I don't do it.
In this context though, it is still valid.

-sln
 
S

sln

There is no need to worry about that, as it generates a rather
clear warning message. (and all sensible Perl programmers enable warnings)

:)
So simple is not better.
while($var = $bar)
while( said:
The "Truth and Falsehood" section in

perldoc perlsyn

states rather clearly what is considered false:

The number 0, the strings C<'0'> and C<''>, the empty list C<()>, and
^^^^^^
What a pity the single character '0' is valid data in while(<DATA>) :)
Who cares about C said:
That isn't in the list from the docs quoted above, so it is "true".

I disagree, its not false, or is that not the case for true?
The World is astounded at such a rare occurrence!

I'm sure this is a false positive.

-sln
 
S

sln

Quoth David G <[email protected]>:

...
Anyway, something like

    s{ (<<.*?>>) | "([^"]*)" | (.) }{
        $1 // $3 // Translate(*TRANS, $1, "GenericString")
    }iegx;

Did you mean to leave the (.) alternative in the solution?

Its not really necessary. And $2 would be passed in Translate().
Could have been written like:

s{ <<.*?>>\K | "([^"]*)" }{
defined $1 and ''
}iegx;

I doesen't matter though, its too simple of an expression.
Any combination of << "asdf>>" "asdf>>" type mixing and
matching will break it. Remember, the guard is against
<< " .. " >> which I'm not so sure he didn't mean in a
balanced sence in the first place. Either way, the expression
is much more complicated.

-sln
 
P

Peter J. Holzer

^^^^^^
What a pity the single character '0' is valid data in while(<DATA>) :)

Doesn't matter.

while (<FH>)
is actually a shorthand for
while (defined($_ = <FH>))
so even if the last line of a file is "0" without a trailing newline,
the loop will still be executed.

I disagree, its not false, or is that not the case for true?

What's the difference between "true" and "not false"?

hp
 
J

Jürgen Exner

Tad McClellan said:
You cannot disagree, because it is not a matter of opinion!

Tad,

you may want to check whom you are replying to. sln disagrees with
everyone and everything.
It is a fact.

Those never got in his way before, why should they now?

Just killfile him, nobody is paying attention to that troll anyway.

jue
 
S

sln

Doesn't matter.

while (<FH>)
is actually a shorthand for
while (defined($_ = <FH>))
so even if the last line of a file is "0" without a trailing newline,
the loop will still be executed.
Yes, I see. I wish they didn't muck this up like this.
Its not used in any other conditional construct and
expanding the expression obliterates the shortcut.

while (($bar = <DATA>) && $flag )
produces traditional behavior (no shortcut), with a nice
little unwanted message:
Value of <HANDLE> construct can be "0"; test with defined() at ..

On, the other side, no message (no shortcut):
while ( $flag && ($bar = <DATA>))

The point being that one of these will get into its block,
the other won't.
if ($bar=<DATA>) { } # thanks message: Value of <HANDLE> construct can be ...
while ($bar=<DATA>) { } # no message here ..

where DATA contains the single character '0'.

I think there could be a more consistent approach but
this '0' is a conditional pivot between numbers and strings
that might facilitate buggy code otherwise.
Hey look, they made a single unique message to warn of this and
even silently pop in a short cut for you if you don't remember.
Well, sometimes.

-sln
 
S

sln

The warning is wanted for the reason already given, this loop
might exit earlier than intended (because $flag will not be
evaluated if $bar eq '0').
Thats not why the warning is there. The same warning shows up with
if ($bar=<DATA>) # a conditional

which has no $flag and no shorthand. Perl has to make that distinction,
when asigning to a scalar from <> and used in a conditional, that '0'
is a falsehood. Its no different then
$bar=<DATA>; # not a conditional and no message
if ($bar) # same thing

I'm testing for the truth of $bar. As you mentioned, there
are more than one falsehood, undefined, '', '0', empty list, etc,
everything else is true.

if (defined($bar)) only tests for one truth - is it defined,
but if ($bar) can still be false even though its defined and
with the <> construct, it can only be '0' that makes it false.

So, its more confusing than not when Perl goes behind your back
with this shortcut within a conditional and only within a
while loop, not other constructs. And only then if there
is the single term conditional as a result of an assignment,
ie: $bar=<FH> or just <> ($_=<FH>).

The result is that Perl alters the conditional, significantly,
to get around '0' as a natural *false* condition.

I would rather have it all one way or all the other as far as
conditionals are concerned.

Personally, I always write mine as:
while (defined($bar=<FH>) {}
just incase I want to come back later and add a term to the conditional:
while ($flag && defined($bar=<FH>) {}


Instead, the distinction could be confusing:

while( $bar=<FH> )
{
if (!$bar) {
print "why am I still in this loop\n";
if (defined $bar) {
print "because i'm defined?\n";
if ($bar eq '0') {
print "'$bar' was there any doubt?\n";
}
}
}
...
last if !($bar=<FH>);
...
$bar = <FH>;
last if !bar;
...
}
# or
while( $flag && $bar=<FH> )
{}

The docs don't really say why the shorthand is dropped if
using multiple terms in the conditional. I'm not so sure why
it is done this way, the shorhand, the rules, what the rules are, etc..
Maybe its to get people guessing and to use Deparse.

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top