What would the regular expression be to replace the word "body" in an
html document as long as it's not in between < and > so it doesn't
replace the actual body tag or anything else with body?
Thanks in advance for any help.
-Tim
As others in the thread have suggested, parsing HTML with a regular
expression is not reliable. If you do a lot of HTML parsing,
HTML::TreeBuilder (
http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/lib/HTML/TreeBuilder.pm)
is a good start. If you're using Windows, the latest ActiveState
builds include the module with the default install. You need a basic
understanding of Perl objects, but with a little effort, its not too
bad - I'm definitely not an expert in Perl or programming in general.
Here's a quick fix:
#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new();
$html->parse_file('test.html');
$html->objectify_text();
my @text_nodes = $html->look_down('_tag','~text',
sub { $_[0]->attr('text') =~ /\bbody\b/i }
);
foreach (@text_nodes) {
(my $new_text = $_->attr('text')) =~ s#\bbody\b##ig;
$_->attr('text',$new_text);
}
$html->deobjectify_text();
print $html->as_HTML();
$html->delete();
Basically, the look_down() method pulls out all text segments, and the
attr() method in the foreach loop does the replacement. As already
noted, depending your *exact* needs, the regular expression used to
identify/replace the string may need to be modified.
HTH
ko