all the text (including tags) between <body> .. </body>

T

tarakparekh

Hello,

I am not well-versed in Perl, so would like to request for
suggestions/help.

My goal is to merge 2 HTML files - particularly, to get everything
between <body> C1 </body> in file1.html and <body> C2 </body>
in file2.html to create file3.html that has:
<body> C1 C2 </body>

C1, C2 can contain that can fall within the "body" tags.

I have taken a look at HTML::parser, as well as HTML::TokeParser,
but on initial tryouts was unable to get the tags themselves.

Some postings indicated reg. expressions are not good for HTML
parsing, but what I am doing in terms of merging is pretty dumb.

Would appreciate any help.
thanks,
tarak
 
S

Simon Taylor

Hello Tarak,
I am not well-versed in Perl, so would like to request for
suggestions/help.

My goal is to merge 2 HTML files - particularly, to get everything
between <body> C1 </body> in file1.html and <body> C2 </body>
in file2.html to create file3.html that has:
<body> C1 C2 </body>

C1, C2 can contain that can fall within the "body" tags.

I have taken a look at HTML::parser, as well as HTML::TokeParser,
but on initial tryouts was unable to get the tags themselves.

Some postings indicated reg. expressions are not good for HTML
parsing, but what I am doing in terms of merging is pretty dumb.

Would appreciate any help.

This may help throw some light on using HTML::parser

http://www.perlmeme.org/tutorials/html_parser.html

And in general, google for "using HTML::parser"

Regards,

Simon Taylor
 
P

Paul Lalli

I am not well-versed in Perl, so would like to request for
suggestions/help.

My goal is to merge 2 HTML files - particularly, to get everything
between <body> C1 </body> in file1.html and <body> C2 </body>
in file2.html to create file3.html that has:
<body> C1 C2 </body>

C1, C2 can contain that can fall within the "body" tags.

I have taken a look at HTML::parser, as well as HTML::TokeParser,
but on initial tryouts was unable to get the tags themselves.

So what were those initial tryouts? And what were the results? No one
can help you fix your program if you don't show us your program.

Please read the posting guidelines for this group. Then please post a
short-but-complete script that demonstrates the errors you're having.
Then we can help you correct those errors.
Some postings indicated reg. expressions are not good for HTML
parsing,
Correct.

but what I am doing in terms of merging is pretty dumb.

I have no idea what 'dumb' means in this context.

Paul Lalli
 
T

tarakparekh

Paul,

Sorry for not posting the script earlier. status.html is a small html
file containing
some links to pictures, and some to other html documents.

!/usr/pkg/bin/perl

package my_parser;
use base 'HTML::parser';

$in_body = 0;
$body = "";

sub start {
my ($self, $tag) = @_;

if ($tag eq 'body') {
$in_body = 1;
}
}

sub end {
my ($self, $tag) = @_;

if ($tag eq 'body') {
$in_body = 0;
}
}

sub text {
my ($self, $text) = @_;

if ($in_body) {
$body .= $text;
}
}

my $p = my_parser->new();
$p -> parse_file('status.html');

print "BODY=$body\n";


--- Results:
BODY=Project: P1Status: Owner: owner1 Issues/Comments:
Issue 1
----

I missed all the links and Image tags as expected, but dont know how to
retain
them.

What i meant by "dumb" was, I wanted to nothing but all the text
between the
<body> .. </body> tags. No other processing.

thanks,
tarak
 
S

Scott Bryce

Sorry for not posting the script earlier. status.html is a small html
file containing some links to pictures, and some to other html
documents.

<code snipped>

Since you only want to know what is between the <body> and the </body>
tags, ask the parser to only report on those tags.


#!/usr/bin/perl
use strict;
use warnings;
use HTML::parser();

my $content;

my $p = HTML::parser->new( api_version => 3,
start_h => [\&start],
end_h => [\&end, 'skipped_text'],
report_tags => ['body']
);

$p->parse_file('status.html') or die "Cannot parse status.html -- $!";

print $content;

sub start
{
# Nothing needs to happen here
}

sub end
{
$content = shift;
}


In Win98SE this is putting an extra CRLF at the end of each line. I
don't know if this is a Windows specific thing, or if I am missing
something in the docs that explains why this is happening.
 
A

A. Sinan Unur

(e-mail address removed) wrote in @g43g2000cwa.googlegroups.com:
Paul,

Sorry for not posting the script earlier. status.html is a small html
file containing
some links to pictures, and some to other html documents.

!/usr/pkg/bin/perl

package my_parser;
use base 'HTML::parser';

I see Scott Bryce has already posted a solution to your problem, but
here is another way to do it:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new(
url => 'http://www.yahoo.com/'
);

my $in_body;

while( my $token = $p->get_token ) {
if( $token->is_start_tag('body') ) {
$in_body = 1;
next;
} elsif( $token->is_end_tag('body') ) {
$in_body = 0;
next;
}
print $token->as_is if $in_body;
}
__END__
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top