H
Helmut Richter
For a seemingly simple problem with regular expressions I tried out several
solutions. One of them seems to be working now, but I would like to learn why
the solutions behave differently. Perl is 5.8.8 on Linux.
The task is to replace the characters # $ \ by their HTML entity, e.g. #
but not within markup. The following code reads and consumes a variable
$inbuf0 and builds up a variable $inbuf with the result.
Solution 1:
while ($inbuf0) {
$inbuf0 =~ /^(?: # skip initial sequences of
[^<\&#\$\\]+ # harmless characters
| <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
| <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
| \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
| <!--(?:.|\n)*?--> # comments
| <[?](?:.|\n)*?[?]> # processing instructions, etc.
)*/x;
$inbuf .= $&;
$inbuf0 = $';
if ($inbuf0) {
$inbuf .= '&#' . ord($inbuf0) . ';';
substr ($inbuf0, 0, 1) = '';
$replaced = 1;
};
};
Here the regexp eats up the maximal initial string (note the * at the end of
the regexp) that needs not be processed and then processes the first character
of the remainder.
In this version, it sometimes works and sometimes blows up with segmentation
fault.
Another version has * instead of + at the "harmless characters". That one does
not try all alternatives as the first one matches always, that is, the * at
the end of the regexp is not used in this case.
Yet another version has nothing instead of + at the "harmless characters";
thus eating zero or one character per iteration of the final *. This should
have the same net effect, but it always blows up with segmentation fault.
Solution 2:
while ($inbuf0) {
if ($inbuf0 =~ /^# skip initial
[^<\&#\$\\]+ # harmless characters
| <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
| <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
| \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
| <!--(?:.|\n)*?--> # comments
| <[?](?:.|\n)*?[?]> # processing instructions, etc.
/x) {
$inbuf .= $&;
$inbuf0 = $';
} else {
$inbuf .= '&#' . ord($inbuf0) . ';';
substr ($inbuf0, 0, 1) = '';
$replaced = 1;
};
};
Here the regexp eats up an initial string, typically not maximal (note the
absence of * at the end of the regexp), that needs not be processed and, if
nothing has been found, processes the first character of the input.
This version runs considerably slower, by a factor of three, but has so far
not yielded segmentation faults. I am using it now.
I am sure there are lots of other ways to do it. With which knowledge
could I have saved the time of the numerous trial-and-error cycles and
done it alright from the beginning?
solutions. One of them seems to be working now, but I would like to learn why
the solutions behave differently. Perl is 5.8.8 on Linux.
The task is to replace the characters # $ \ by their HTML entity, e.g. #
but not within markup. The following code reads and consumes a variable
$inbuf0 and builds up a variable $inbuf with the result.
Solution 1:
while ($inbuf0) {
$inbuf0 =~ /^(?: # skip initial sequences of
[^<\&#\$\\]+ # harmless characters
| <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
| <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
| \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
| <!--(?:.|\n)*?--> # comments
| <[?](?:.|\n)*?[?]> # processing instructions, etc.
)*/x;
$inbuf .= $&;
$inbuf0 = $';
if ($inbuf0) {
$inbuf .= '&#' . ord($inbuf0) . ';';
substr ($inbuf0, 0, 1) = '';
$replaced = 1;
};
};
Here the regexp eats up the maximal initial string (note the * at the end of
the regexp) that needs not be processed and then processes the first character
of the remainder.
In this version, it sometimes works and sometimes blows up with segmentation
fault.
Another version has * instead of + at the "harmless characters". That one does
not try all alternatives as the first one matches always, that is, the * at
the end of the regexp is not used in this case.
Yet another version has nothing instead of + at the "harmless characters";
thus eating zero or one character per iteration of the final *. This should
have the same net effect, but it always blows up with segmentation fault.
Solution 2:
while ($inbuf0) {
if ($inbuf0 =~ /^# skip initial
[^<\&#\$\\]+ # harmless characters
| <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
| <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
| \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
| <!--(?:.|\n)*?--> # comments
| <[?](?:.|\n)*?[?]> # processing instructions, etc.
/x) {
$inbuf .= $&;
$inbuf0 = $';
} else {
$inbuf .= '&#' . ord($inbuf0) . ';';
substr ($inbuf0, 0, 1) = '';
$replaced = 1;
};
};
Here the regexp eats up an initial string, typically not maximal (note the
absence of * at the end of the regexp), that needs not be processed and, if
nothing has been found, processes the first character of the input.
This version runs considerably slower, by a factor of three, but has so far
not yielded segmentation faults. I am using it now.
I am sure there are lots of other ways to do it. With which knowledge
could I have saved the time of the numerous trial-and-error cycles and
done it alright from the beginning?