Marc said:
How can I allowed some HTML-Tags like <BR>, <B>, <P> but
filter out <, >, when they stand alone?
Typically, people don't like you using regexes for this kind of taask,
because the pattern would be *really* complex before working
satisfactorily. Instead, use something involving a HTML parser module.
I like HTML::TokeParser::Simple for that kind of task.
<
http://search.cpan.org/search?module=HTML::TokeParser::Simple>
You loop through the input, processing one token (tag, comment, piece
of text) at a time, act differently depending on the type of token and
its actual contents, and can use $token->as_is to just pass it through
unchanged (the ordinary case). You can filter out disallowed tags,
disallowed attributes. You could probably even use it to balance the
left over, allowed tokens.
Here's a demo script (do at least remove the whitespace in front of the
line containing just "*END*"):
use HTML::TokeParser::Simple;
my $html = <<"*END*";
<P>Get up in the morning, slaving for bread, sir,
<BR>so that every mouth can be fed.
<P><B>Poor me</B>, the Israelite. <I>Aah.</I>
<!-- this is a comment. It'll be gone. -->
<P>There's a lone "<" in here, matched by a lone ">".
<script language="Javascript">alert("Hello, World!")</script>
<P>I don't like <a href="
http://example.com">links</a> either,
but will allow for <a name="foo"></a>anchors.
*END*
my $p = HTML::TokeParser::Simple->new(\$html);
my %allow = map { $_ => 1 } qw(b i u br p);
my %wipe_content = map { $_ => 1 } qw(style script);
my %escape = ( '<' => '<', '>' => '>');
while(my $t = $p->get_token) {
if($t->is_tag) {
my $tag = $t->get_tag;
if($tag eq 'a') {
print $t->as_is, "</a>" if defined
$t->get_attr('name');
} elsif($allow{$tag}) {
print $t->as_is;
} elsif($wipe_content{$tag}) {
while(my $t = $p->get_token) {
# wipe
last if $t->is_end_tag($tag);
}
}
} elsif($t->is_comment) {
# wipe
} elsif($t->is_text) {
my $text = $t->as_is;
$text =~ s/([<>])/$escape{$1}/g;
print $text;
}
}
Result:
<P>Get up in the morning, slaving for bread, sir,
<BR>so that every mouth can be fed.
<P><B>Poor me</B>, the Israelite. <I>Aah.</I>
<P>There's a lone "<" in here, matched by a lone ">".
<P>I don't like links either,
but will allow for <a name="foo"></a>anchors.