[As J. Gleixner has already pointed out, there are HTML parsers
available for perl - doing this with a regexp is almost certainly not
the best way to do this]
Please see the FAQ and the many, many archived posts why HTML and REGEX
is not a viable combination.
What exactly do you mean by "remove all html except <p> tags"?
What would the result of processing the following (simple) file be?
<html>
<head>
<title>
A test
</title>
</head>
<body>
<h1> A test </h1> <h2> for Robs script </h2>
<p>
The quick brown fox jumps over the lazy dog.
</p>
<table>
<tr>
<td>
<p>
upper left
</p>
<p>
lower left
</p>
</td>
<td>
<p>
right
</p>
</td>
</tr>
</table>
<!--
<p>
This is not a paragraph
</p>
-->
<p>
Over & out!
</p>
</body>
Well, what have you tried?
Some tips:
* Start with a formal grammar of what you want to match.
I usually use some form of BNF.
* Don't try to write the whole Regexp at once. Use one Regexp
for every production in your grammar and use variable substitution
to build more complex regexps (there is a parallel thread about
matching RFC5322 headers with some examples).
* Use /x and comments.
That is not surprising because it cannot be done for arbitrary HTML. For
further details please read up on the Chomsky hierarchy of languages.
Care to explain how the difference between regular and context-free
grammars is relevant to the task at hand? And you know of course that
Perl regexps are a superset of regular expressions, so that even if the
task is impossible with a regular expression, it may still be possible
with a regexp (has anyone tried to prove that regexps are/are not
equivalent to context-free grammars lately?).
hp