Remove all HTML but keep <p> tags

R

Rob

I am looking for a perl REGEX statement to remove all the HTML from a
string except for the <p> tags. It would have to leave the <p> (and
the </p>) tags but also longer ones such as <p style=...> etc. I
haven't been able to find anything similar online for this.

Can anyone help with a suitable REGEX for this? I have tried a few
things but had no success.

Any help would be much appreciated.

Rob
 
J

J. Gleixner

I am looking for a perl REGEX statement to remove all the HTML from a
string except for the<p> tags. It would have to leave the<p> (and
the</p>) tags but also longer ones such as<p style=...> etc. I
haven't been able to find anything similar online for this.

Can anyone help with a suitable REGEX for this? I have tried a few
things but had no success.

Any help would be much appreciated.

Rob

Depending on how complex the 'string' is, you probably want to
avoid a regular expression solution and use a parser.e.g.
HTML::parser.

Read the documentation and take a look a some of the examples
in the distribution, like hstrip and htext.
 
J

Jürgen Exner

Rob said:
I am looking for a perl REGEX statement to remove all the HTML from a

Please see the FAQ and the many, many archived posts why HTML and REGEX
is not a viable combination.
string except for the <p> tags. It would have to leave the <p> (and
the </p>) tags but also longer ones such as <p style=...> etc. I
haven't been able to find anything similar online for this.

Can anyone help with a suitable REGEX for this? I have tried a few
things but had no success.

That is not surprising because it cannot be done for arbitrary HTML. For
further details please read up on the Chomsky hierarchy of languages.

jue
 
P

Peter J. Holzer

[As J. Gleixner has already pointed out, there are HTML parsers
available for perl - doing this with a regexp is almost certainly not
the best way to do this]


Please see the FAQ and the many, many archived posts why HTML and REGEX
is not a viable combination.

What exactly do you mean by "remove all html except <p> tags"?

What would the result of processing the following (simple) file be?


<html>
<head>
<title>
A test
</title>
</head>
<body>
<h1> A test </h1> <h2> for Robs script </h2>
<p>
The quick brown fox jumps over the lazy dog.
</p>
<table>
<tr>
<td>
<p>
upper left
</p>
<p>
lower left
</p>
</td>
<td>
<p>
right
</p>
</td>
</tr>
</table>
<!--
<p>
This is not a paragraph
</p>
-->
<p>
Over &amp; out!
</p>
</body>

Well, what have you tried?

Some tips:

* Start with a formal grammar of what you want to match.
I usually use some form of BNF.
* Don't try to write the whole Regexp at once. Use one Regexp
for every production in your grammar and use variable substitution
to build more complex regexps (there is a parallel thread about
matching RFC5322 headers with some examples).
* Use /x and comments.

That is not surprising because it cannot be done for arbitrary HTML. For
further details please read up on the Chomsky hierarchy of languages.

Care to explain how the difference between regular and context-free
grammars is relevant to the task at hand? And you know of course that
Perl regexps are a superset of regular expressions, so that even if the
task is impossible with a regular expression, it may still be possible
with a regexp (has anyone tried to prove that regexps are/are not
equivalent to context-free grammars lately?).

hp
 
G

George Mpouras

# try this
use strict;
use warnings;
my $htm=sub{local $/=undef;$_=$_[0];<$_>}->(\*DATA);
while( $htm =~/<p[^>]*?>(.*?)<\/p>/gi ) {
print "*$^N*\n"
}

__DATA__

<p>Earth</p> blah1 <p style=...>Sun</p> blah1
<p style=...>Moon</p> blah2 <p>
Venus
</p><p style=...>Hermes</p>blah2<p>
Jupiter</p>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top