J
jl_post
Hi,
I have a Perl script that processes multi-line input. The problem
is, sometimes this input has newlines stuck in arbitrary places (such
as right in the middle of a valid token). This makes the input out-of-
spec, but I have no control over this, so I want to correct it if I
can. What's more is, sometimes this newline breaks a token in two,
where the first half still looks like a valid token while the other
does not, and vice-versa.
I'm trying to modify my Perl script so that it reviews every
newline and see if it should be discarded. The logic I want to use is
to throw out every newline UNLESS it is flanked (on both sides) by
valid tokens. I would like to be able to do something like this:
# Create a regular expression that matches tokens
# like "N50E40", "N50 E40", "N5000 E4000",
# "50N40E", "50N 40E", and "5000N4000E":
my $tokenRegExp = qr/\b(?:[NS]\d+\s*[EW]\d+|\d+[NS]\s*\d+[EW])\b/;
# Remove newlines that are not surrounded by valid tokens:
$input =~ s/(?<!$tokenRegExp)\n(?=$tokenRegExp)//g; # no token
before
$input =~ s/(?<=$tokenRegExp)\n(?!$tokenRegExp)//g; # no token
after
$input =~ s/(?<!$tokenRegExp)\n(?!$tokenRegExp)//g; # no tokens
The problem is is that the look-behind assertions (both positive
and negative) only work for fixed-width expressions, according to
"perldoc perlre". Unfortunately, it would be so useful for me to be
able to match a string with a variable look-behind, that I'm hoping
there's a logical work-around to this limitation.
Is there any way for me to work around this limitation?
Thanks.
-- Jean-Luc
I have a Perl script that processes multi-line input. The problem
is, sometimes this input has newlines stuck in arbitrary places (such
as right in the middle of a valid token). This makes the input out-of-
spec, but I have no control over this, so I want to correct it if I
can. What's more is, sometimes this newline breaks a token in two,
where the first half still looks like a valid token while the other
does not, and vice-versa.
I'm trying to modify my Perl script so that it reviews every
newline and see if it should be discarded. The logic I want to use is
to throw out every newline UNLESS it is flanked (on both sides) by
valid tokens. I would like to be able to do something like this:
# Create a regular expression that matches tokens
# like "N50E40", "N50 E40", "N5000 E4000",
# "50N40E", "50N 40E", and "5000N4000E":
my $tokenRegExp = qr/\b(?:[NS]\d+\s*[EW]\d+|\d+[NS]\s*\d+[EW])\b/;
# Remove newlines that are not surrounded by valid tokens:
$input =~ s/(?<!$tokenRegExp)\n(?=$tokenRegExp)//g; # no token
before
$input =~ s/(?<=$tokenRegExp)\n(?!$tokenRegExp)//g; # no token
after
$input =~ s/(?<!$tokenRegExp)\n(?!$tokenRegExp)//g; # no tokens
The problem is is that the look-behind assertions (both positive
and negative) only work for fixed-width expressions, according to
"perldoc perlre". Unfortunately, it would be so useful for me to be
able to match a string with a variable look-behind, that I'm hoping
there's a logical work-around to this limitation.
Is there any way for me to work around this limitation?
Thanks.
-- Jean-Luc