Removing HTML

B

Bill H

Does anyone have an easy/fast way of removing HTML tags from a text
file using perl? I am using the following brute force way of doing it,
which does work, but can be a little slow:

$body = substr($body,0,$a);

while(index($body,"<") != -1)
{
$a = index($body,"<");
$beg = substr($body,0,$a);
$body = substr($body,$a + 1)." ";
$a = index($body,">");
$fin = substr($body,$a + 1);
$body = $beg.$fin;
}
$body =~ s/ / /gi;
 
M

Martijn Lievaart

Does anyone have an easy/fast way of removing HTML tags from a text
file using perl? I am using the following brute force way of doing it,
which does work, but can be a little slow:

An alternative to what the other posters said, but possibly even slower is
to use lynx with the --dump option.

HTH,
M4
 
J

Joe Smith

Bill said:
Does anyone have an easy/fast way of removing HTML tags from a text
file using perl?

Once upon a time someone complained that HTML::parser was too hard to use.
I whipped up this little ditty to show it's easy.
-Joe


#!/usr/bin/perl
# Name: nohtml Author: (e-mail address removed) 07-Nov-2001
# Purpose: Extracts just the text portions of a document.

use strict; use warnings;
use HTML::parser ();

sub text_handler { # Ordinary text
print @_;
}

my $p = HTML::parser->new(api_version => 3);
$p->handler( text => \&text_handler, "dtext");
$p->parse_file(shift || "-") || die $!;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,202
Messages
2,571,057
Members
47,662
Latest member
sxarexu

Latest Threads

Top