Variable length lookbehind not implemented

F

fmassion

Hi folks:

My text (sample):

saddle stitcher: <font color="#008080"><b>repl. of 8 saddle stitcher</b></font> <font color="#8000FF">

Goal:
I want to put numbers in square brakets, but only if they do not occur within tags.

My code:

#!/usr/bin/perl -w
open(IN,'sample.txt') || die("Datei kann nicht geöffnet werden!\n");
my $number = '(?<!<.*?)\d+(?!.*?>)';
while(<IN>) {
$_ =~ s/$number/\[$number\]/g;
print "$_\n";
}
close (IN);

Error message:

Variable length lookbehind not implemented in regex m/(?<!<.*?)\d+(?!.*?>)/at D:\Perl\test.pl line 5, <IN> line 1.

I couldn't find an explanation for this error message. Has anyone an idea?
 
C

Charles DeRykus

Hi folks:

My text (sample):

saddle stitcher: <font color="#008080"><b>repl. of 8 saddle stitcher</b></font> <font color="#8000FF">

Goal:
I want to put numbers in square brakets, but only if they do not occur within tags.

My code:

#!/usr/bin/perl -w
open(IN,'sample.txt') || die("Datei kann nicht geöffnet werden!\n");
my $number = '(?<!<.*?)\d+(?!.*?>)';
while(<IN>) {
$_ =~ s/$number/\[$number\]/g;
print "$_\n";
}
close (IN);

Error message:

Variable length lookbehind not implemented in regex m/(?<!<.*?)\d+(?!.*?>)/ at D:\Perl\test.pl line 5, <IN> line 1.

I couldn't find an explanation for this error message. Has anyone an idea?

See "negative look-behind" in perlre. The explanation is "works only for
fixed-width look-behind".

A quick, probably fragile, alternative:

my text;
{ undef $/; $text = <IN>;}

while ( $text =~ /\G ([^<]*?) (<.*?>) /sgx ) {
my($out, $in) = ($1,$2);
$out =~ s/(\d+)/[$1]/ag;
print $out, $in;
}
 
U

Uri Guttman

CD> Better written: { local $/; $text = <IN>}

even better:

use File::Slurp ;
my $text = read_file( $file ) ;

uri
 
F

fmassion

Thanks to all of you for the explanations.

This code does the trick:

use File::Slurp ;
my $text = read_file( 'testfile.txt' ) ;
while ( $text =~ /\G ([^<]*?) (<.*?>) /sgx ) {
my($out, $in) = ($1,$2);
$out =~ s/(\d+)/[$1]/ag;
print $out, $in;
}

It also works with these lines:
my text;
{ undef $/; $text = <IN>;}

This is the result of the test:

saddle stitcher:| </font><font color="#008080"><b>repl. of [2] saddle stitcher</b></font> <font color="#8000FF">Mishandled paper:| </font><font color="#008080"><b>repl. of mishandled paper</b></font><br>Please add [8] staples .... (only numbers outside the tags have been processed.)
Francois
 
F

fmassion

Sorry, I found a flaw in the expression:

while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {

If the text doesn't end with a tag, the last $out is not printed in:
print $out, $in;

The last printed character is a ">"
We need somehow to find an expression whicht prints the remaining characters.
 
R

Rainer Weikusat

Charles DeRykus said:
Better written: { local $/; $text = <IN>}

Adding the reason for that: local $/ creates a new binding for $/
which is dynamically scoped to the enclosing block (it has dynamic
extent and indefinite scope[*]). This implies that $/ reverts to its
former value after the enclosing block has finished executing. Except
in very 'controlled and limited' circumstance, this is preferable to
overwriting whatever the current value happens to be at the moment and
'leaking' this 'local policy descision' to the all code executeing
after the block.

[*] The Lisp-terminology[**] is somewhat lacking here because the
newly established binding is only visible to code which is reachable
via an execution path starting in the block and this will usually only
be a subset of all of the program code (in absence of travesties like
'execute a random function found via the symbol table of a random
package').

[**]

http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node43.html
 
R

Rainer Weikusat

Sorry, I found a flaw in the expression:

while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {

If the text doesn't end with a tag, the last $out is not printed in:
print $out, $in;

The last printed character is a ">"

You could use a proper 'lexer' for HTML.

NB: This is something I just wrote down because I thought it couldn't
be that difficult. It is assumed that numbers which are part of a word
shouldn't be bracketed.

--------------
{
local $/;
$_ = <STDIN>;
}

my $in_tag;

{
unless ($in_tag) {
/\G</gc && do {
++$in_tag;
print('<');
redo;
};

/\G\b(\d+)\b/gc && do {
print("[$1]");
redo;
};

(/\G(\d+)/gc
|| /\G([^\d<]+)/gc) && do {
print($1);
redo;
};
} else {
/\G>/gc && do {
print('>');
--$in_tag;
redo;
};

/\G</gc && do {
print('<');
++$in_tag;
redo;
};

/\G([^<>]+)/gc && do {
print($1);
redo;
};
}
}
 
C

Charles DeRykus

Sorry, I found a flaw in the expression:

while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {

If the text doesn't end with a tag, the last $out is not printed in:
print $out, $in;

The last printed character is a ">"
We need somehow to find an expression whicht prints the remaining characters.



This might be a quick fix.. but again it's probably fragile
in many cases.

while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
my($out, $in) = ($1 // '', $2 // '');
$out =~ s/(\d+)/[$1]/ag;
print $out,$in;
}

If unfamiliar with any of the above replacement regex items:

See: perldoc perlre # (?: ) and/or \z
perldoc perlop # \G and/or //

also perlre for the /a modifier
 
R

Rainer Weikusat

Charles DeRykus said:
Sorry, I found a flaw in the expression:

while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {

If the text doesn't end with a tag, the last $out is not printed in:
print $out, $in;
[...]

This might be a quick fix.. but again it's probably fragile
in many cases.

while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
my($out, $in) = ($1 // '', $2 // '');
$out =~ s/(\d+)/[$1]/ag;
print $out,$in;
}

It will also replace numbers in words (which may or may not be
desired). Also, according to a quick test, using

while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {

works, too.
 
R

Rainer Weikusat

[...]
while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
my($out, $in) = ($1 // '', $2 // '');

Also according to a quick test I made, a () which matched an empty
string (this includes 'optional' ()s which didn't match anything)
causes an empty string to be put into the corresponding $n which
implies that the $1 // '' is not even useful as workaround for
less-than-useful perl runtime warnings.
 
C

Charles DeRykus

[...]
while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
my($out, $in) = ($1 // '', $2 // '');

Also according to a quick test I made, a () which matched an empty
string (this includes 'optional' ()s which didn't match anything)
causes an empty string to be put into the corresponding $n which
implies that the $1 // '' is not even useful as workaround for
less-than-useful perl runtime warnings.

That's much better. (But, that's why I was careful to use the weasel
words "quick" and "fragile" when responding :)

And since the html's pedigree is unknown, an un-entified "<" causes
problems for both:

just a single un-entified < and any no. 1,2,... to \z vanish


You could add /c and take care of even that I think but, at some point
if you want another great leap, a parser is the way to go.
 
C

Charles DeRykus

if you want another great leap, a parser is the way to go.

I'm not sure this is the "great leap" but here's a possible parser approach:

use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new_from_file( $filename );

foreach my $tag ($root->look_down(sub{1) ) {
while( my($index,$child) = each $tag->content_array_ref ) {
unless ( ref($child) eq "HTML::Element" ) {
$child =~ s/(\d+)/[$1]/ag; # 1replaces no's in words
$tag->splice_content( $index,1,$child );
}
}
}
print $root->as_HTML();
 
F

fmassion

Also, according to a quick test, using
while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {
works, too.

Yes it works, but unfortunately I get an error message about "uninitialized value $in"

My test strings (it's bullshit, just to test the expression). In practise I am using chunks of HTML/XML files, i.e. text which cannot be parsed because not all the required tags are in the text.

Test sentences:
2-side slitting 64 scrap box is full <S 64R> Please empty slitting 654 scrap box
Please 345 set Saddle stitcher 2-Side <S 65 R> slitting 1008 scrap box5
2-side slitting 64 scrap box is full <S 64R> Please empty slitting 654 scrap box

Result with "while ( $text =~ /\G ([^<]*?) (<.*?>) /sgx ) { "

[2]-side slitting [64] scrap box is full <S 64R> Please empty slitting [654] scrap box
Please [345] set Saddle stitcher [2]-Side <S 65 R> slitting [1008] scrap box[5]
[2]-side slitting [64] scrap box is full <S 64R>

Result with while "( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {"

[2]-side slitting [64] scrap box is full <S 64R> Please empty slitting [654] scrap box
Please [345] set Saddle stitcher [2]-Side <S 65 R> slitting [1008] scrap box[5]
Use of uninitialized value $in in print at D:\Perl\test.pl line 18.
Use of uninitialized value $in in print at D:\Perl\test.pl line 18.
[2]-side slitting [64] scrap box is full <S 64R> Please empty slitting [654] scrap box

This is line 18: print $out, $in;

Thus all sentences have been processed as they should have, but there are 2 times an uninitialized value "$in".
 
R

Rainer Weikusat

Also, according to a quick test, using

while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {
works, too.

Yes it works, but unfortunately I get an error message about
"uninitialized value $in"

The easiest way to deal with spurious warnings is "don't enable them"
:->. perl does automatic type conversions whenever necessary but some
people are STRONGLY (!!!!) convinced that programmer convenience is a
surefire way to achieve disaster (why these people dabble in perl
instead of 'languages designed to be obnoxious', ie, C++ or Java,
escapes me ...).

Apart from that, there are various more-or-less ugly workarounds.
The

my ($out, $in) = ($1 // '', $2 // '')

would be one.

Some others

------
while ( $text =~ /\G ([^<]+)|(<.*?>) /sgx ) {
if ($1) {
my $out = $1;
$out =~ s/(\d+)/[$1]/g;
print $out;
} else {
print $2;
}
}
------

This matches either a 'free text' sequence or a complete tag and
performs the substitution when the 'free text' match was successful.

------
while ( $text =~ /\G ([^<]+|<.*?>) /sgx ) {
my $out = $1;
$out =~ s/(\d+)/[$1]/g if $out !~ /^</;
print $out;
}
-----

This is essentially the same except that the matched text always ends
up in $1 so the content of that needs to be examined in order to
determine which it was.

-----
for ($text) {
/\G([^<]+)/gc && do {
my $out = $1;
$out =~ s/(\d+)/[$1]/g;
print $out;
redo;
};

/\G(<.*?>)/g && do {
print $1;
redo;
};
}
----

This use for to alias text to $_. It then checks if either a 'free
text' sequence or a complete tag can be found at the current match
position and performs the correct action for each, followed by a
'redo' in order to restart the loop. If neither pattern matched, end
of the input has obviously been reached and the loop (sort of)
terminates.

NB: The first match needs an additional /c to avoid resetting the
match position if it fails. The second one doesn't because if it
fails, the loop will terminate, anyway.
 
R

Rainer Weikusat

Charles DeRykus said:
[...]
while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
my($out, $in) = ($1 // '', $2 // '');

Also according to a quick test I made, a () which matched an empty
string (this includes 'optional' ()s which didn't match anything)
causes an empty string to be put into the corresponding $n which
implies that the $1 // '' is not even useful as workaround for
less-than-useful perl runtime warnings.

That's much better. (But, that's why I was careful to use the weasel
words "quick" and "fragile" when responding :)

And since the html's pedigree is unknown, an un-entified "<" causes
problems for both:

just a single un-entified < and any no. 1,2,... to \z vanish

Filters are ill-suited for syntax checking because they will produce
garbage output in case of errors.

BTW: Why <.*?> and not <.*>?
 
R

Rainer Weikusat

[...]

-----
for ($text) {
/\G([^<]+)/gc && do {
my $out = $1;
$out =~ s/(\d+)/[$1]/g;
print $out;
redo;
};

/\G(<.*?>)/g && do {

This should be

/\G(<.*?>)/gs

so that tags formatted like this

<
hippocampus
are also matched.
 
F

fmassion

Thanks Rainer,

I'll just mention here what worked and what didn't work:

This didn't work as expected. The last bit of text has not been processed:
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,699
Latest member
AnneRosen

Latest Threads

Top