Remove all HTML but keep tags

Rob · Feb 10, 2012

I am looking for a perl REGEX statement to remove all the HTML from a
string except for the tags. It would have to leave the (and
the ) tags but also longer ones such as etc. I
haven't been able to find anything similar online for this.

Can anyone help with a suitable REGEX for this? I have tried a few
things but had no success.

Any help would be much appreciated.

Rob

J. Gleixner · Feb 10, 2012

I am looking for a perl REGEX statement to remove all the HTML from a
string except for the tags. It would have to leave the (and
the) tags but also longer ones such as etc. I
haven't been able to find anything similar online for this.

Can anyone help with a suitable REGEX for this? I have tried a few
things but had no success.

Any help would be much appreciated.

Rob

Depending on how complex the 'string' is, you probably want to
avoid a regular expression solution and use a parser.e.g.
HTML:

arser.

Read the documentation and take a look a some of the examples
in the distribution, like hstrip and htext.

Jürgen Exner · Feb 11, 2012

Rob said:
I am looking for a perl REGEX statement to remove all the HTML from a

Please see the FAQ and the many, many archived posts why HTML and REGEX
is not a viable combination.

string except for the tags. It would have to leave the (and
the ) tags but also longer ones such as etc. I
haven't been able to find anything similar online for this.

Can anyone help with a suitable REGEX for this? I have tried a few
things but had no success.

That is not surprising because it cannot be done for arbitrary HTML. For
further details please read up on the Chomsky hierarchy of languages.

jue

Peter J. Holzer · Feb 11, 2012

[As J. Gleixner has already pointed out, there are HTML parsers
available for perl - doing this with a regexp is almost certainly not
the best way to do this]

Please see the FAQ and the many, many archived posts why HTML and REGEX
is not a viable combination.

What exactly do you mean by "remove all html except tags"?

What would the result of processing the following (simple) file be?

<html>
<head>
<title>
A test
</title>
</head>
<body>
<h1> A test </h1> <h2> for Robs script </h2>

The quick brown fox jumps over the lazy dog.

<table>
<tr>
<td>

upper left


lower left

</td>
<td>

right

</td>
</tr>
</table>


Over & out!

</body>

Well, what have you tried?

Some tips:

* Start with a formal grammar of what you want to match.
I usually use some form of BNF.
* Don't try to write the whole Regexp at once. Use one Regexp
for every production in your grammar and use variable substitution
to build more complex regexps (there is a parallel thread about
matching RFC5322 headers with some examples).
* Use /x and comments.

That is not surprising because it cannot be done for arbitrary HTML. For
further details please read up on the Chomsky hierarchy of languages.

Care to explain how the difference between regular and context-free
grammars is relevant to the task at hand? And you know of course that
Perl regexps are a superset of regular expressions, so that even if the
task is impossible with a regular expression, it may still be possible
with a regexp (has anyone tried to prove that regexps are/are not
equivalent to context-free grammars lately?).

hp

George Mpouras · Feb 14, 2012

# try this
use strict;
use warnings;
my $htm=sub{local $/=undef;$_=$_[0];<$_>}->(\*DATA);
while( $htm =~/<p[^>]*?>(.*?)<\/p>/gi ) {
print "*$^N*\n"
}

__DATA__

Earth blah1 Sun blah1
Moon blah2 
Venus
Hermesblah2
Jupiter

HTML Assessment for interview	2	Feb 16, 2024
Hi, I am a webflow user. I am looking for CSS code that can KEEP ALL ELEMENTS POSITIONED in the SAME spot across all resolutions	0	Oct 27, 2023
How can I remove the extra space marked in the image attached to my Email HTML template?	2	Feb 25, 2023
Site Migration Ruined all my CSS??	8	Aug 1, 2023
Stuck with html and css	25	Dec 14, 2022
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022
All CRUD operations work except POST. Why?	2	May 28, 2023

Remove all HTML but keep <p> tags

Rob

J. Gleixner

Jürgen Exner

Peter J. Holzer

George Mpouras

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads