HTML::Parser - duplicated text in <h2> .. </h2> ?

G

Geoff Cox

Hello

I am trying to get back into using HTML::parser after a year's gap!

In one html file I am getting a few instances of

<h2>headingheading</h2>

ie the heading text is repeated.

Other <h2> headings in this same file are OK.

Any idea where I should start to look for the solution?!

Thanks

Geoff
 
J

J. Gleixner

Geoff said:
Hello

I am trying to get back into using HTML::parser after a year's gap!

In one html file I am getting a few instances of

<h2>headingheading</h2>

ie the heading text is repeated.

Other <h2> headings in this same file are OK.

Any idea where I should start to look for the solution?!

ahhhh.. Looking at your code would be the place to start??? :)

Sounds like it might be something to do with the first occurrence of the
h2 element, but who knows... If using some print's and "perl -d" doesn't
help you figure it out on your own, post a short sample of your code
that exhibits the issue.
 
G

Geoff Cox

ahhhh.. Looking at your code would be the place to start??? :)

!! was hoping not to have to bother you with my wonderful code!
Sounds like it might be something to do with the first occurrence of the
h2 element, but who knows... If using some print's and "perl -d" doesn't
help you figure it out on your own, post a short sample of your code
that exhibits the issue.

Thanks for the ideas - it just seems odd that 99% of the h2 elements
are handled OK ... so could the problem be in the html file rather
than in the Perl code? ... but cannot see why ...

Cheers

Geoff
 
G

Geoff Cox

ahhhh.. Looking at your code would be the place to start??? :)

as I say in my other email it seem sodd that 99% of the h2 elements
are handled correctly ..? The part with the h2 element is as follows
... anything wrong there?

Geoff

package MyParser;
use base qw(HTML::parser);
use strict;
use diagnostics;

my ($in_h2, $in_p, $in_ul, $in_li, $fh);

sub register_fh { $fh = $_[1]; }

sub reset { ($in_h2, $in_p, $in_ul, $in_li )=(0,0)}

sub start {
my ($p, $t, $a, undef, $txt ) = @_;

if ($t eq 'p') {
$in_p = 1;
print $fh '<p>';
return;
}

if ($t eq 'h2') {
$in_h2 = 1;
print $fh '<h2>';
return;
}

if ($t eq 'li') {
$in_li = 1;
print $fh '<li>';
return;
}

if ($t eq 'option') {
main::choice( $a->{ value } );
return;
}

}

sub end {
my ($p, $t, $txt) = @_;

if ($t eq 'p') {
$in_p = 0;
print $fh "</p>\n";
return;
}

if ($t eq 'h2') {
$in_h2 = 0;
print $fh "</h2>\n";
return;
}

if ($t eq 'li') {
$in_li = 0;
print $fh "</li>\n";
return;
}

}

sub text {
my ($p, $txt) = @_;
print $fh $txt if ($in_p);
print $fh $txt if ($in_h2);
print $fh $txt if ($in_li);
}

package main;

etc etc
 
J

J. Gleixner

Geoff said:
as I say in my other email it seem sodd that 99% of the h2 elements
are handled correctly ..? The part with the h2 element is as follows
.. anything wrong there?

I'm not sure what you're passing to those methods, however simplifying
those to just "tagname", for &start and &end, and "dtext" for &text it
shows what you're seeing if an <h2> occurs within a <p> and/or an <li>
element.

Why?? Because $in_p is 1 and $in_h2 is 1, so $txt is printed twice. The
text within 'h2', "abcd" will be displayed 3 times if the html contained
<p><ul><h2>abcd</h2></ul></p>.

This will help you see where/why it's occuring in your code:

sub text {
my ($txt) = @_;

print $fh "in_p txt=$txt" if $in_p;
print $fh "in_h2 txt=$txt" if $in_h2;
print $fh "in_li txt=$txt" if $in_li;
}
 
G

Geoff Cox

I'm not sure what you're passing to those methods, however simplifying
those to just "tagname", for &start and &end, and "dtext" for &text it
shows what you're seeing if an <h2> occurs within a <p> and/or an <li>
element.

Why?? Because $in_p is 1 and $in_h2 is 1, so $txt is printed twice. The
text within 'h2', "abcd" will be displayed 3 times if the html contained
<p><ul><h2>abcd</h2></ul></p>.

Thanks!

I checked the HTML and found a <p> associated error which when
corrected allowed the h2 elements to work correctly.

I had

<p align="center"><img src="assets/image.jpg"</p>

which should have had the ">" in front of the </p>

ie <p align="center"><img src="assets/image.jpg"></p>

Cheers

Geoff
 
J

J. Gleixner

Geoff said:
Thanks!

I checked the HTML and found a <p> associated error which when
corrected allowed the h2 elements to work correctly.

Correcting your text method would be a better long-term choice, since
<p><h2>foo</h2></p> is valid HTML, and will occur, not to mention
<p><ul><li>foo</li></ul></p> or many other nested element combinations.
(hint: Do one print, instead of 3, when any of your variables are 1
and there is text to print.)
 
J

John W. Kennedy

J. Gleixner said:
Correcting your text method would be a better long-term choice, since
<p><h2>foo</h2></p> is valid HTML,

Actually, no, it isn't. Not in any version.
---
John W. Kennedy
"The bright critics assembled in this volume will doubtless show, in
their sophisticated and ingenious new ways, that, just as /Pooh/ is
suffused with humanism, our humanism itself, at this late date, has
become full of /Pooh./"
-- Frederick Crews. "Postmodern Pooh", Preface
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,169
Messages
2,570,919
Members
47,459
Latest member
Vida00R129

Latest Threads

Top