perl html parser

kevin kitenik · Nov 11, 2010

Hi everybody,

i have a piece of html file, that countain special if-then-else statements :
like these ones:
<if condition="$vboptions['hometitle']"><a href="$vboptions[homeurl]">$vboptions[hometitle]</a> -</
if>
<if condition="$vboptions[privacyurl]"><a href="$vboptions[privacyurl]"><else><tr><td>test here
</if>

thos i statement can be imbricated :
if ... then ...
else
if .. then ...
fi
fi

the problem is i awant a wat to tansform these staments to :
((cond1)) ? (exec1)) : ((exec2)) styles.

how can i do this ???
i used the cpan without any succes !!

use Parse::RecDescent;
my @s=( q{<if condition="$vboptions['hometitle']"> <a href="$vboptions[homeurl]">$vboptions
[hometitle]</a> - </if>});

&pars;
sub pars {
my $parser = new Parse::RecDescent( q{
startrule: S
S: if ifC '">' then S else S fi {$return="(($item[2]) ? (\"$item[5]\") : ($item[7]))";}
| if ifC '">' then S fi {$return="(($item[2]) ? (\"$item[5]\") : (\"\"))";}
| html {$return=$item[1];}
if: '<if condition="'
fi: '</if>'
ifC: /[^"]+/
then: ''
else: '<else>' | '<else />'
html: /[\w\d_\$,\[\] ="\/\<\>-]+/ });
foreach my $s (@s){
print $s . ":\n" . $parser->startrule( $s ) . "\n"} }

i thank you in advance, for any syggestions, cause i have a headeack ;-)

sln · Nov 12, 2010

Hi everybody,

i have a piece of html file, that countain special if-then-else statements :
like these ones:
<if condition="$vboptions['hometitle']"><a href="$vboptions[homeurl]">$vboptions[hometitle]</a> -</
if>
<if condition="$vboptions[privacyurl]"><a href="$vboptions[privacyurl]"><else><tr><td>test here
</if>

thos i statement can be imbricated :
if ... then ...
else
if .. then ...
fi
fi

the problem is i awant a wat to tansform these staments to :
((cond1)) ? (exec1)) : ((exec2)) styles.

how can i do this ???
i used the cpan without any succes !!

use Parse::RecDescent;
my @s=( q{<if condition="$vboptions['hometitle']"> <a href="$vboptions[homeurl]">$vboptions
[hometitle]</a> - </if>});

&pars;
sub pars {
my $parser = new Parse::RecDescent( q{
startrule: S
S: if ifC '">' then S else S fi {$return="(($item[2]) ? (\"$item[5]\") : ($item[7]))";}
| if ifC '">' then S fi {$return="(($item[2]) ? (\"$item[5]\") : (\"\"))";}
| html {$return=$item[1];}
if: '<if condition="'
fi: '</if>'
ifC: /[^"]+/
then: ''
else: '<else>' | '<else />'
html: /[\w\d_\$,\[\] ="\/\<\>-]+/ });
foreach my $s (@s){
print $s . ":\n" . $parser->startrule( $s ) . "\n"} }

i thank you in advance, for any syggestions, cause i have a headeack ;-)

I'm not supprised you have a headache.
You could see what its doing if you set $::RD_TRACE = 1;

Lets look at one of your data strings.
q{<if condition="$vboptions[privacytitle]"><a href="$vboptions[privacyurl]">
<else><tr><td> test here
</if>}

-----------------------------------
use strict;
use warnings;
use Parse::RecDescent;

$::RD_TRACE = 1;

my @s=(
q{<if condition="$vboptions[privacytitle]"><a href="$vboptions[privacyurl]">
<else><tr><td> test here
</if>}
);

&pars;

sub pars {
my $parser = new Parse::RecDescent( q{

startrule: S
S: if ifC '">' then S else S fi {$return="(($item[2]) ? (\"$item[5]\") : ($item[7]))";}
| if ifC '">' then S fi {$return="(($item[2]) ? (\"$item[5]\") : (\"\"))";}
| html {$return=$item[1];}

if: '<if condition="'
fi: '</if>'
ifC: /[^"]*/
then: ''
else: '<else>' | '<else />'
html: /[\w\d_\$,\[\] ="\/<>-]+/ });
foreach my $s (@s)
{
print "\n",'+'x30,"\n",$s,":\n",'-'x30,"\n", ($parser->startrule( $s )),"\n";
}
}

__END__
-----------------------------------

The first time through S, it finds
if ifC '">' then
which is
if: $item[1] - '<if condition="' (literal)
ifC: $item[2] - '$vboptions[privacytitle]' =~ /[^"]+/
$item[3] - '">' (literal)
then: $item[4] - '' (literal)

Then it recurses S, it finds
html
which is
$item[5] - '<a href="$vboptions[privacyurl]">' =~ /[\w\d_\$,\[\] ="\/<>-]+/

Back from recursion, it then finds
else
which is
$item[6] - '<else>' (literal)

Then, recurse S again, it finds
html
which is
$item[7] - '<tr><td> test here' =~ /[\w\d_\$,\[\] ="\/<>-]+/

Back from recursion, it then finds
fi
which is
$item[8] - '</fi>' (literal)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This code will produce a proper result if and
only if there is a separator between

S (separator) else
and
S (separator) fi

that is NOT in the html production /[\w\d_\$,\[\] ="\/<>-]+/.

This can be a TAB or a NewLine because that is not in the character
class of that regex.

For example, this:
q{<if condition="$vboptions[privacytitle]"><a href="$vboptions[privacyurl]"><else>
<tr><td> test here
</if>}
will fail because
'<a href="$vboptions[privacyurl]"><else>' =~ /[\w\d_\$,\[\] ="\/<>-]+/
will match, taking else: with it

And,
q{<if condition="$vboptions[privacytitle]"><a href="$vboptions[privacyurl]">
<else><tr><td> test here </if>}
will fail because
'<tr><td> test here </if>' =~ /[\w\d_\$,\[\] ="\/<>-]+/
will match, taking fi: with it

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The regular expressions (as you have them there) are independent.
They are not set up to backtrack.
I think that backtracking is available as a more advanced production concept,
however this can't be done with something as trivial as
/[\w\d_\$,\[\] ="\/<>-]+/
Indeed, the whole realm of discreet, character level parsing is needed for markup.

If however, you are in control of creating the input data, just fashion it so
that known delimeters are inserted where necessary. Then you can generate the correct
html, or whatever it is you are doing.

-sln

PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022
Search Results with Pagination	0	Oct 25, 2024
Problem with android and scrolling with <input textarea	5	May 18, 2022
How to play corresponding sound?	2	Jun 10, 2023
How to ignore newline in Parse::RecDescent	10	Apr 24, 2010
Canvas drawing HTML Javascript on elementor	1	Feb 22, 2023
When draggable=false & draggable=true dont work ?	1	Jan 14, 2023
I want to Display Excel As HTML In js	2	Feb 24, 2023

perl html parser

kevin kitenik

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads