HTML::Parser not stripping out comments

J

Jay

I'm trying to get HTML::parser to strip out the comments using some of the
sample code from the man page. I'm using the ignore_elements and I still
get comments in the dtext. Am I doing something wrong?

Tia,
Jay

CODE:
use HTML::parser ();

# Create parser object
$p = HTML::parser->new( api_version => 3,
start_h => [\&start, "tagname, attr"],
end_h => [\&end, "tagname"],
comment_h => [\&comment, "self,text"],
text_h => [\&dtext, "self,text"],
marked_sections => 1,
);

$p->ignore_elements( qw(script, comment, style) );
$p->strict_comment( [1] );
# Parse directly from file
$p->parse_file("0");


sub start {
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
#...
}

sub end {
my($self, $tagname, $origtext) = @_;
#...
}

sub text {
my($self, $origtext, $is_cdata) = @_;
#...
}
sub comment{
#my($self, $origtext, $is_cdata) = @_;
#...
}

sub dtext {
my($self, $dtext ) = @_;
$dtext=~s/\s+/ /g;
print "DTEXT: $dtext\n";
}

Example of some of the output from parsing some web page:

DTEXT: <!-- /* You may give each page an identifying name, server, and
channel on the next lines. */ var s_pageName="buy"; var s_server="CWEB15";
var s_channel="buy"; var s_pageTyp
e=""; var s_prop1="Autoweb Direct to Site"; var s_prop2="Autoweb Direct to
Site 10714"; var s_prop3=""; var s_prop4=""; var s_prop5=""; var s_prop6="";
var s_prop7="buy|"; var s_pr
op8=""; var s_prop9="buy|Autoweb Direct to Site|10714"; var s_prop10="buy|";
var s_prop11="Autoweb Direct to Site|10714|taweb"; var s_prop12="||"; var
s_prop13="||||||buy||No"; var
s_prop14="Autoweb Direct to Site|10714|taweb|||||buy||No"; var s_prop15="No
Article|No Article"; var s_prop16=""; var s_prop17=""; var s_prop18="Autoweb
Direct to Site|10714|buy";
var s_prop19="Autoweb Direct to Site|10714||buy"; var
s_prop20="buy||||sky|ban|Autoweb Direct to Site"; /* E-commerce Variables */
var s_campaign="10714"; var s_state=""; var s_zi
p=""; var s_events=""; var s_products=""; var s_purchaseID=""; var
s_eVar1="Autoweb Direct to Site"; var s_eVar2="Autoweb Direct to Site
10714"; var s_eVar3="NT-sky-ban"; var s_eVa
r4=""; var s_eVar5=""; /********* INSERT THE DOMAIN AND PATH TO YOUR CODE
BELOW ************/ /********** DO NOT ALTER ANYTHING ELSE BELOW THIS LINE!
*************/ var s_code=' '/
/-->
DTEXT:
DTEXT:
 
G

Gisle Aas

Jay said:
I'm trying to get HTML::parser to strip out the comments using some of the
sample code from the man page. I'm using the ignore_elements and I still
get comments in the dtext. Am I doing something wrong?

Probably :)
CODE:
use HTML::parser ();

# Create parser object
$p = HTML::parser->new( api_version => 3,
start_h => [\&start, "tagname, attr"],
end_h => [\&end, "tagname"],
comment_h => [\&comment, "self,text"],
text_h => [\&dtext, "self,text"],
marked_sections => 1,

You really want marked_sections to be enabled?
);

$p->ignore_elements( qw(script, comment, style) );

You need to remove all the "," here. Otherwise they end up part of
the strings passed to ignore_elements. Also there is no <comment> tag
in HTML, so there is not comment element either.

The extra commas probably explain why you get the JavaScript comment
reported to your &dtext callback. Anything between <script> and
$p->strict_comment( [1] );

I don't think you actually want strict_comment enabled either. A
plain 1 is also a perfectly fine true boolean.
# Parse directly from file
$p->parse_file("0");

That's a strange file name.

Regards,
Gisle
 
J

Jay

Gisle Aas said:
Jay said:
I'm trying to get HTML::parser to strip out the comments using some of the
sample code from the man page. I'm using the ignore_elements and I still
get comments in the dtext. Am I doing something wrong?

Probably :)
CODE:
use HTML::parser ();

# Create parser object
$p = HTML::parser->new( api_version => 3,
start_h => [\&start, "tagname, attr"],
end_h => [\&end, "tagname"],
comment_h => [\&comment, "self,text"],
text_h => [\&dtext, "self,text"],
marked_sections => 1,

You really want marked_sections to be enabled?
);

$p->ignore_elements( qw(script, comment, style) );

You need to remove all the "," here. Otherwise they end up part of
the strings passed to ignore_elements. Also there is no <comment> tag
in HTML, so there is not comment element either.

The extra commas probably explain why you get the JavaScript comment
reported to your &dtext callback. Anything between <script> and
$p->strict_comment( [1] );

I don't think you actually want strict_comment enabled either. A
plain 1 is also a perfectly fine true boolean.
# Parse directly from file
$p->parse_file("0");

That's a strange file name.

Regards,
Gisle

Thanks Gisle,
I will look at this some more with your reccomendations and post the
results.
yes, a strange filename.

Jay
 
J

Jay

Jay said:
I'm trying to get HTML::parser to strip out the comments using some of the
sample code from the man page. I'm using the ignore_elements and I still
get comments in the dtext. Am I doing something wrong?

Tia,
Jay

CODE:
use HTML::parser ();

# Create parser object
$p = HTML::parser->new( api_version => 3,
start_h => [\&start, "tagname, attr"],
end_h => [\&end, "tagname"],
comment_h => [\&comment, "self,text"],
text_h => [\&dtext, "self,text"],
marked_sections => 1,
);

$p->ignore_elements( qw(script, comment, style) );
$p->strict_comment( [1] );
# Parse directly from file
$p->parse_file("0");


sub start {
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
#...
}

sub end {
my($self, $tagname, $origtext) = @_;
#...
}

sub text {
my($self, $origtext, $is_cdata) = @_;
#...
}
sub comment{
#my($self, $origtext, $is_cdata) = @_;
#...
}

sub dtext {
my($self, $dtext ) = @_;
$dtext=~s/\s+/ /g;
print "DTEXT: $dtext\n";
}

Example of some of the output from parsing some web page:

DTEXT: <!-- /* You may give each page an identifying name, server, and
channel on the next lines. */ var s_pageName="buy"; var s_server="CWEB15";
var s_channel="buy"; var s_pageTyp
e=""; var s_prop1="Autoweb Direct to Site"; var s_prop2="Autoweb Direct to
Site 10714"; var s_prop3=""; var s_prop4=""; var s_prop5=""; var s_prop6="";
var s_prop7="buy|"; var s_pr
op8=""; var s_prop9="buy|Autoweb Direct to Site|10714"; var s_prop10="buy|";
var s_prop11="Autoweb Direct to Site|10714|taweb"; var s_prop12="||"; var
s_prop13="||||||buy||No"; var
s_prop14="Autoweb Direct to Site|10714|taweb|||||buy||No"; var s_prop15="No
Article|No Article"; var s_prop16=""; var s_prop17=""; var s_prop18="Autoweb
Direct to Site|10714|buy";
var s_prop19="Autoweb Direct to Site|10714||buy"; var
s_prop20="buy||||sky|ban|Autoweb Direct to Site"; /* E-commerce Variables */
var s_campaign="10714"; var s_state=""; var s_zi
p=""; var s_events=""; var s_products=""; var s_purchaseID=""; var
s_eVar1="Autoweb Direct to Site"; var s_eVar2="Autoweb Direct to Site
10714"; var s_eVar3="NT-sky-ban"; var s_eVa
r4=""; var s_eVar5=""; /********* INSERT THE DOMAIN AND PATH TO YOUR CODE
BELOW ************/ /********** DO NOT ALTER ANYTHING ELSE BELOW THIS LINE!
*************/ var s_code=' '/
/-->
DTEXT:
DTEXT:


That did the trick, thanks alot.

Jay
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,702
Latest member
LukasConde

Latest Threads

Top