Replace a word if its not in an html tag

T

Tim

What would the regular expression be to replace the word "body" in an
html document as long as it's not in between < and > so it doesn't
replace the actual body tag or anything else with body?

Thanks in advance for any help.

-Tim
 
C

Chesucat

What would the regular expression be to replace the word "body" in an
html document as long as it's not in between < and > so it doesn't
replace the actual body tag or anything else with body?

Thanks in advance for any help.

-Tim

s/body//g

--
(e-mail address removed)
SDF Public Access UNIX System - http://sdf.lonestar.org
Carelessly planned projects take three times longer to complete than
expected. Carefully planned projects take four times longer to
complete than expected, mostly because the planners expect their
planning to reduce the time it takes.
 
T

Tad McClellan

Tim said:
What would the regular expression be to replace the word "body" in an
html document as long as it's not in between < and >


What you seek cannot be done reliably with a pattern match.

You should use a module that understands HTML for processing HTML.
 
T

Tad McClellan

Silly attempt #1.

Changes "<body>" into "<>".

It appears that you did not even read the problem specification
before offering your "answer".

Try:
s/([^<])body([^>])/${1}new_body$2/g; # not begining or end of line


Silly attempt #2.

Changes <img src="nobody"> into <img src="nonew_body">


[snip further silly attempts that also fail]

There is no such thing as "lines" in HTML. If you think in terms
of lines when processing HTML then you are not thinking correctly.
 
M

Matt Garrish

Tad McClellan said:
Try:
s/([^<])body([^>])/${1}new_body$2/g; # not begining or end of line


Silly attempt #2.

Changes <img src="nobody"> into <img src="nonew_body">

Sadly, nothing is 100% when it comes to unstructured html, not even the
oft-cited parser modules. Makes you wonder why they bother with the DTDs
when Microsoft and Netscape have done everything they can to accomodate bad
tagging.

Matt

It's free, use it: http://validator.w3.org/
 
T

Tad McClellan

JS Bangs said:
Chesucat sikyal:

Better would be


to give up on trying to do it with a pattern match.

s/([^<|<\/])\bbody\b/$1/g
^^^^^^^^

This does not do what you think it does, it is exactly equivalent to:


s/([^\/|<])\bbody\b/$1/g
and
s/([^<\/|])\bbody\b/$1/g
and
s/([^|\/<])\bbody\b/$1/g

the order of characters in a character class does not matter,
neither does repeating a character.

which appears to do what you want.


It does not appear that way to me.

<img src="body"> becomes <img src=""> with your code

We were supposed to leave it alone when it was inside of a tag.

<p>See body.doc</p> becomes <p>See .doc</p>

It is not clear what we were supposed to do in that case. Poor spec.

To the OP: Better yet would be to read
http://www.perldoc.com/perl5.6/pod/perlre.html so you can understand what
this means.


JS Bangs should re-read too, it would appear. :)

Along with this Perl FAQ:

How do I remove HTML from a string?

which gives some truly tricky cases rather than just the somewhat
obvious ones that trip up all of the code in this thread so far.
 
J

Jürgen Exner

Tim said:
What would the regular expression be to replace the word "body" in an
html document as long as it's not in between < and > so it doesn't
replace the actual body tag or anything else with body?

Question: in
<tag>body</tag>
is the text "body" between < and > or not?
I would say yes, because it is between the first and the last character of
that line and those characters happen to be < and >. So you may want to
rethink your question.

Having said that, please note that contrary to popular believe correct
parsing of HTML is rocket science and while it may be possible to do it with
REs any sane person would not attempt it but use an HTML parser to parse
HTML.

jue
 
K

ko

What would the regular expression be to replace the word "body" in an
html document as long as it's not in between < and > so it doesn't
replace the actual body tag or anything else with body?

Thanks in advance for any help.

-Tim

As others in the thread have suggested, parsing HTML with a regular
expression is not reliable. If you do a lot of HTML parsing,
HTML::TreeBuilder (http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/lib/HTML/TreeBuilder.pm)
is a good start. If you're using Windows, the latest ActiveState
builds include the module with the default install. You need a basic
understanding of Perl objects, but with a little effort, its not too
bad - I'm definitely not an expert in Perl or programming in general.
Here's a quick fix:

#!/usr/bin/perl -w
use strict;

use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new();
$html->parse_file('test.html');
$html->objectify_text();
my @text_nodes = $html->look_down('_tag','~text',
sub { $_[0]->attr('text') =~ /\bbody\b/i }
);
foreach (@text_nodes) {
(my $new_text = $_->attr('text')) =~ s#\bbody\b##ig;
$_->attr('text',$new_text);
}

$html->deobjectify_text();
print $html->as_HTML();
$html->delete();

Basically, the look_down() method pulls out all text segments, and the
attr() method in the foreach loop does the replacement. As already
noted, depending your *exact* needs, the regular expression used to
identify/replace the string may need to be modified.

HTH
ko
 
T

Tad McClellan

Jürgen Exner said:
Question: in
<tag>body</tag>
is the text "body" between < and > or not?


Even more troubling, are these ones?

<!-- body -->

<!-- <body> NOT a body tag! -->
 
J

JS Bangs

Tad McClellan sikyal:
JS Bangs said:
Chesucat sikyal:
Better would be


to give up on trying to do it with a pattern match.
Agreed.
s/([^<|<\/])\bbody\b/$1/g
^^^^^^^^

This does not do what you think it does, it is exactly equivalent to:


s/([^\/|<])\bbody\b/$1/g
and
s/([^<\/|])\bbody\b/$1/g
and
s/([^|\/<])\bbody\b/$1/g

the order of characters in a character class does not matter,
neither does repeating a character.

You're correct. How *does* one assert a negative group of multiple
characters?
It does not appear that way to me.

<img src="body"> becomes <img src=""> with your code

We were supposed to leave it alone when it was inside of a tag.

<p>See body.doc</p> becomes <p>See .doc</p>

It is not clear what we were supposed to do in that case. Poor spec.

Indeed. When the OP said "Don't replace 'body' inside a tag", I DWIMmed
that to mean "Don't replace the <body></body> elements". If he truly does
not want to replace 'body' inside a tag, then a regexp is definitely out
of the question.
JS Bangs should re-read too, it would appear. :)

Along with this Perl FAQ:

How do I remove HTML from a string?

That certainly does trip me up :).

--
Jesse S. Bangs (e-mail address removed)
http://students.washington.edu/jaspax/
http://students.washington.edu/jaspax/blog

Jesus asked them, "Who do you say that I am?"

And they answered, "You are the eschatological manifestation of the ground
of our being, the kerygma in which we find the ultimate meaning of our
interpersonal relationship."

And Jesus said, "What?"
 
J

Janek Schleicher

Jeff 'japhy' Pinyan wrote at Thu, 07 Aug 2003 17:21:17 -0400:
Last time I checked, variable-width patterns weren't allowed in a
look-behind.

At least in this case, there's an easy workaround, expanding the variable
length negative look behind into its fixed length possibilities:

m!(?<! / body)
(?<! body)!x;


Greetings,
Janek
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,125
Messages
2,570,748
Members
47,301
Latest member
SusannaCgx

Latest Threads

Top