Replace a word if its not in an html tag

Tim · Aug 6, 2003

What would the regular expression be to replace the word "body" in an
html document as long as it's not in between < and > so it doesn't
replace the actual body tag or anything else with body?

Thanks in advance for any help.

-Tim

Chesucat · Aug 6, 2003

What would the regular expression be to replace the word "body" in an
html document as long as it's not in between < and > so it doesn't
replace the actual body tag or anything else with body?

Thanks in advance for any help.

-Tim

s/body//g

--
(e-mail address removed)
SDF Public Access UNIX System - http://sdf.lonestar.org
Carelessly planned projects take three times longer to complete than
expected. Carefully planned projects take four times longer to
complete than expected, mostly because the planners expect their
planning to reduce the time it takes.

Tad McClellan · Aug 6, 2003

Tim said:
What would the regular expression be to replace the word "body" in an
html document as long as it's not in between < and >

What you seek cannot be done reliably with a pattern match.

You should use a module that understands HTML for processing HTML.

Tad McClellan · Aug 6, 2003

Silly attempt #1.

Changes "<body>" into "<>".

It appears that you did not even read the problem specification
before offering your "answer".

Try:
s/([^<])body([^>])/${1}new_body$2/g; # not begining or end of line

Silly attempt #2.

Changes <img src="nobody"> into <img src="nonew_body">

[snip further silly attempts that also fail]

There is no such thing as "lines" in HTML. If you think in terms
of lines when processing HTML then you are not thinking correctly.

Matt Garrish · Aug 7, 2003

Tad McClellan said:
Try:
s/([^<])body([^>])/${1}new_body$2/g; # not begining or end of line

Click to expand...

Silly attempt #2.

Changes <img src="nobody"> into <img src="nonew_body">

Sadly, nothing is 100% when it comes to unstructured html, not even the
oft-cited parser modules. Makes you wonder why they bother with the DTDs
when Microsoft and Netscape have done everything they can to accomodate bad
tagging.

Matt

It's free, use it: http://validator.w3.org/

Tad McClellan · Aug 7, 2003

JS Bangs said:
Chesucat sikyal:

Better would be

to give up on trying to do it with a pattern match.

s/([^<|<\/])\bbody\b/$1/g

^^^^^^^^

This does not do what you think it does, it is exactly equivalent to:

s/([^\/|<])\bbody\b/$1/g
and
s/([^<\/|])\bbody\b/$1/g
and
s/([^|\/<])\bbody\b/$1/g

the order of characters in a character class does not matter,
neither does repeating a character.

which appears to do what you want.

It does not appear that way to me.

<img src="body"> becomes <img src=""> with your code

We were supposed to leave it alone when it was inside of a tag.

See body.doc becomes See .doc

It is not clear what we were supposed to do in that case. Poor spec.

To the OP: Better yet would be to read
http://www.perldoc.com/perl5.6/pod/perlre.html so you can understand what
this means.

JS Bangs should re-read too, it would appear.

Along with this Perl FAQ:

How do I remove HTML from a string?

which gives some truly tricky cases rather than just the somewhat
obvious ones that trip up all of the code in this thread so far.

Jürgen Exner · Aug 7, 2003

Tim said:
What would the regular expression be to replace the word "body" in an
html document as long as it's not in between < and > so it doesn't
replace the actual body tag or anything else with body?

Question: in
<tag>body</tag>
is the text "body" between < and > or not?
I would say yes, because it is between the first and the last character of
that line and those characters happen to be < and >. So you may want to
rethink your question.

Having said that, please note that contrary to popular believe correct
parsing of HTML is rocket science and while it may be possible to do it with
REs any sane person would not attempt it but use an HTML parser to parse
HTML.

jue

ko · Aug 7, 2003

What would the regular expression be to replace the word "body" in an
html document as long as it's not in between < and > so it doesn't
replace the actual body tag or anything else with body?

Thanks in advance for any help.

-Tim

As others in the thread have suggested, parsing HTML with a regular
expression is not reliable. If you do a lot of HTML parsing,
HTML::TreeBuilder (http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/lib/HTML/TreeBuilder.pm)
is a good start. If you're using Windows, the latest ActiveState
builds include the module with the default install. You need a basic
understanding of Perl objects, but with a little effort, its not too
bad - I'm definitely not an expert in Perl or programming in general.
Here's a quick fix:

#!/usr/bin/perl -w
use strict;

use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new();
$html->parse_file('test.html');
$html->objectify_text();
my @text_nodes = $html->look_down('_tag','~text',
sub { $_[0]->attr('text') =~ /\bbody\b/i }
);
foreach (@text_nodes) {
(my $new_text = $_->attr('text')) =~ s#\bbody\b##ig;
$_->attr('text',$new_text);
}

$html->deobjectify_text();
print $html->as_HTML();
$html->delete();

Basically, the look_down() method pulls out all text segments, and the
attr() method in the foreach loop does the replacement. As already
noted, depending your *exact* needs, the regular expression used to
identify/replace the string may need to be modified.

HTH
ko

Tad McClellan · Aug 7, 2003

Jürgen Exner said:
Question: in
<tag>body</tag>
is the text "body" between < and > or not?

Even more troubling, are these ones?

JS Bangs · Aug 7, 2003

Tad McClellan sikyal:

JS Bangs said:
JS Bangs said:

Chesucat sikyal:

Click to expand...

Better would be

Click to expand...

to give up on trying to do it with a pattern match.
Agreed.

s/([^<|<\/])\bbody\b/$1/g

Click to expand...

^^^^^^^^

This does not do what you think it does, it is exactly equivalent to:

s/([^\/|<])\bbody\b/$1/g
and
s/([^<\/|])\bbody\b/$1/g
and
s/([^|\/<])\bbody\b/$1/g

the order of characters in a character class does not matter,
neither does repeating a character.

You're correct. How *does* one assert a negative group of multiple
characters?

It does not appear that way to me.

<img src="body"> becomes <img src=""> with your code

We were supposed to leave it alone when it was inside of a tag.

See body.doc becomes See .doc

It is not clear what we were supposed to do in that case. Poor spec.

Indeed. When the OP said "Don't replace 'body' inside a tag", I DWIMmed
that to mean "Don't replace the <body></body> elements". If he truly does
not want to replace 'body' inside a tag, then a regexp is definitely out
of the question.

JS Bangs should re-read too, it would appear.

Along with this Perl FAQ:

How do I remove HTML from a string?

That certainly does trip me up

.

--
Jesse S. Bangs (e-mail address removed)
http://students.washington.edu/jaspax/
http://students.washington.edu/jaspax/blog

Jesus asked them, "Who do you say that I am?"

And they answered, "You are the eschatological manifestation of the ground
of our being, the kerygma in which we find the ultimate meaning of our
interpersonal relationship."

And Jesus said, "What?"

Tad McClellan · Aug 7, 2003

JS Bangs said:
Tad McClellan sikyal:

s/([^\/|<])\bbody\b/$1/g

Click to expand...

Click to expand...

How *does* one assert a negative group of multiple
characters?

With a "negative look-behind assertion" (perlre.pod):

/(?<! <\/? body )/x

Jeff 'japhy' Pinyan · Aug 7, 2003

JS Bangs said:
JS Bangs said:

Tad McClellan sikyal:

s/([^\/|<])\bbody\b/$1/g

Click to expand...

Click to expand...

How *does* one assert a negative group of multiple
characters?

Click to expand...

With a "negative look-behind assertion" (perlre.pod):

/(?<! <\/? body )/x

Last time I checked, variable-width patterns weren't allowed in a
look-behind.

Janek Schleicher · Aug 8, 2003

Jeff 'japhy' Pinyan wrote at Thu, 07 Aug 2003 17:21:17 -0400:

Last time I checked, variable-width patterns weren't allowed in a
look-behind.

At least in this case, there's an easy workaround, expanding the variable
length negative look behind into its fixed length possibilities:

m!(?<! / body)
(?<! body)!x;

Greetings,
Janek

Gisle Aas · Aug 12, 2003

What would the regular expression be to replace the word "body" in an
html document as long as it's not in between < and > so it doesn't
replace the actual body tag or anything else with body?

There is an example of this in the HTML-Parser distribution. See:

http://search.cpan.org/src/GAAS/HTML-Parser-3.28/eg/htextsub

If you run this program as 'htextsub s/body/foo/g file.html' it should
do what you want.

I'm about to get in trouble with the HTML <body></body> tags	10	Aug 12, 2023
I took a project from github and its not running	2	Oct 29, 2022
HTML Anchor tag not working	2	Dec 15, 2020
Sort and count word pairs in a string	6	Jan 29, 2023
WIN32 - Update Text in a Window in order to show its size in Pixels and coordinates	0	Oct 4, 2023
Need help with <rowspan> in an HTML table	1	Nov 6, 2024
JavaScript code not working!!	6	Jun 13, 2023
Changing .html in URL	3	Jul 11, 2022

Replace a word if its not in an html tag

Tim

Chesucat

Tad McClellan

Tad McClellan

Matt Garrish

Tad McClellan

Jürgen Exner

ko

Tad McClellan

JS Bangs

Tad McClellan

Jeff 'japhy' Pinyan

Janek Schleicher

Gisle Aas

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads