all the text (including tags) between <body> .. </body>

tarakparekh · Sep 7, 2005

Hello,

I am not well-versed in Perl, so would like to request for
suggestions/help.

My goal is to merge 2 HTML files - particularly, to get everything
between <body> C1 </body> in file1.html and <body> C2 </body>
in file2.html to create file3.html that has:
<body> C1 C2 </body>

C1, C2 can contain that can fall within the "body" tags.

I have taken a look at HTML:

arser, as well as HTML::TokeParser,
but on initial tryouts was unable to get the tags themselves.

Some postings indicated reg. expressions are not good for HTML
parsing, but what I am doing in terms of merging is pretty dumb.

Would appreciate any help.
thanks,
tarak

Simon Taylor · Sep 7, 2005

Hello Tarak,

I am not well-versed in Perl, so would like to request for
suggestions/help.

My goal is to merge 2 HTML files - particularly, to get everything
between <body> C1 </body> in file1.html and <body> C2 </body>
in file2.html to create file3.html that has:
<body> C1 C2 </body>

C1, C2 can contain that can fall within the "body" tags.

I have taken a look at HTML:arser, as well as HTML::TokeParser,
but on initial tryouts was unable to get the tags themselves.

Some postings indicated reg. expressions are not good for HTML
parsing, but what I am doing in terms of merging is pretty dumb.

Would appreciate any help.

This may help throw some light on using HTML:

arser

http://www.perlmeme.org/tutorials/html_parser.html

And in general, google for "using HTML:

arser"

Regards,

Simon Taylor

Paul Lalli · Sep 7, 2005

I am not well-versed in Perl, so would like to request for
suggestions/help.

My goal is to merge 2 HTML files - particularly, to get everything
between <body> C1 </body> in file1.html and <body> C2 </body>
in file2.html to create file3.html that has:
<body> C1 C2 </body>

C1, C2 can contain that can fall within the "body" tags.

I have taken a look at HTML:arser, as well as HTML::TokeParser,
but on initial tryouts was unable to get the tags themselves.

So what were those initial tryouts? And what were the results? No one
can help you fix your program if you don't show us your program.

Please read the posting guidelines for this group. Then please post a
short-but-complete script that demonstrates the errors you're having.
Then we can help you correct those errors.

Some postings indicated reg. expressions are not good for HTML
parsing,
Correct.

but what I am doing in terms of merging is pretty dumb.

I have no idea what 'dumb' means in this context.

Paul Lalli

tarakparekh · Sep 7, 2005

Paul,

Sorry for not posting the script earlier. status.html is a small html
file containing
some links to pictures, and some to other html documents.

!/usr/pkg/bin/perl

package my_parser;
use base 'HTML:

arser';

$in_body = 0;
$body = "";

sub start {
my ($self, $tag) = @_;

if ($tag eq 'body') {
$in_body = 1;
}
}

sub end {
my ($self, $tag) = @_;

if ($tag eq 'body') {
$in_body = 0;
}
}

sub text {
my ($self, $text) = @_;

if ($in_body) {
$body .= $text;
}
}

my $p = my_parser->new();
$p -> parse_file('status.html');

print "BODY=$body\n";

--- Results:
BODY=Project: P1Status: Owner: owner1 Issues/Comments:
Issue 1
----

I missed all the links and Image tags as expected, but dont know how to
retain
them.

What i meant by "dumb" was, I wanted to nothing but all the text
between the
<body> .. </body> tags. No other processing.

thanks,
tarak

Scott Bryce · Sep 7, 2005

Sorry for not posting the script earlier. status.html is a small html
file containing some links to pictures, and some to other html
documents.

<code snipped>

Since you only want to know what is between the <body> and the </body>
tags, ask the parser to only report on those tags.

#!/usr/bin/perl
use strict;
use warnings;
use HTML:

arser();

my $content;

my $p = HTML:

arser->new( api_version => 3,
start_h => [\&start],
end_h => [\&end, 'skipped_text'],
report_tags => ['body']
);

$p->parse_file('status.html') or die "Cannot parse status.html -- $!";

print $content;

sub start
{
# Nothing needs to happen here
}

sub end
{
$content = shift;
}

In Win98SE this is putting an extra CRLF at the end of each line. I
don't know if this is a Windows specific thing, or if I am missing
something in the docs that explains why this is happening.

A. Sinan Unur · Sep 8, 2005

(e-mail address removed) wrote in @g43g2000cwa.googlegroups.com:

Paul,

Sorry for not posting the script earlier. status.html is a small html
file containing
some links to pictures, and some to other html documents.

!/usr/pkg/bin/perl

package my_parser;
use base 'HTML:arser';

I see Scott Bryce has already posted a solution to your problem, but
here is another way to do it:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new(
url => 'http://www.yahoo.com/'
);

my $in_body;

while( my $token = $p->get_token ) {
if( $token->is_start_tag('body') ) {
$in_body = 1;
next;
} elsif( $token->is_end_tag('body') ) {
$in_body = 0;
next;
}
print $token->as_is if $in_body;
}
__END__

How to merge two files like the following with the XML or text parser	8	Oct 27, 2005
FAQ 6.3 How can I pull out lines between two patterns that are themselves on different lines?	0	Jan 14, 2011
Download the JAVA , .NET and SQL Server interview PDF	0	Sep 17, 2006
Download the JAVA , .NET and SQL Server interview with answers	0	Sep 14, 2006
Can't make this page work	6	Mar 8, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007

all the text (including tags) between <body> .. </body>

tarakparekh

Simon Taylor

Paul Lalli

tarakparekh

Scott Bryce

A. Sinan Unur

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads