Simple Regular expression problem

K

Krebul

Hi,

I'm trying to write a regular expression to escape the < charactar
within an XML file. I only want to escape the character when its the
value of a node. Ex:

"<xmlnode>one < two</xmlnode>"

I want to convert that to:

"<xmlnode>one &lt; two</xmlnode>"

The logic I decided to use was to simply escape any < character,
followed by followed by another < character, without a > character in
between.

I cannot get my regexp to work, but the closest I've come was:

$str =~ s/<.*?(?!>)</\&lt;$&</g;

Please advise what I'm doing wrong.

Thanks
-Krebul
 
T

Tad McClellan

Krebul said:
The logic I decided to use was to simply escape any < character,
followed by followed by another < character, without a > character in
between.


$str =~ s/<([^>]*<)/&lt;$1/g;
 
K

Keith Keller

I'm trying to write a regular expression to escape the < charactar
within an XML file. I only want to escape the character when its the
value of a node. Ex:

"<xmlnode>one < two</xmlnode>"

I want to convert that to:

"<xmlnode>one &lt; two</xmlnode>"

This seems to scream out to me to not use a regex, but one of
the XML parser modules. XML::Simple is probably a good place
to start.

--keith
 
K

Krebul

Thank you. That was very helpful!


Christian said:
Krebul said:
I'm trying to write a regular expression to escape the < charactar
within an XML file. I only want to escape the character when its the
value of a node. Ex:

"<xmlnode>one < two</xmlnode>"

I want to convert that to:

"<xmlnode>one &lt; two</xmlnode>"

The logic I decided to use was to simply escape any < character,
followed by followed by another < character, without a > character in
between.

I cannot get my regexp to work, but the closest I've come was:

$str =~ s/<.*?(?!>)</\&lt;$&</g;

Please advise what I'm doing wrong.

For one you're misunderstanding the meaning of zero-width
assertions, for the second point your logic is flawed.
They way you describe your solution, you would escape the
opening brackets of the tags instead of in-betweens (in
fact, that's what you're doing, together with doubling up
the opening bracket in between).

Your look-ahead assertion is plainly ignored, as the
non-greedy .*? will of course not be followed by a closing
bracket when it's followed by an opening bracket.

What you want to do is to escape any opening angle bracket
that is followed by an arbitrary number of characters that
are _not_ closing angle brackets which are followed by another
opening angle bracket.

To put that description into a pattern:
s/
< # opening angle bracket
(?=
[^>]* # anything but not a closing angle bracket
< # another opening angle bracket
)
/&lt;/xg;

Note that here I have used the zero-width look-ahead (?=)
so that I don't capture anything from that subpattern, so
I don't have to care for any $1..$n or $& in the replacement.

The /x modifier lets me put in whitespace and comments in
the pattern, which makes it much more legible.

HTH
-Chris
 
J

J. Gleixner

Keith said:
This seems to scream out to me to not use a regex, but one of
the XML parser modules. XML::Simple is probably a good place
to start.

Except that since it's not valid XML, it can't be parsed.

% perl -MXML::Simple -e'XMLin( "<xmlnode>one < two</xmlnode>" )'
Invalid element name [Ln: 1, Col: 14]

If possible, fix whatever is generating the invalid XML.

Also, '>' and '&' need to be encoded too, not just '<'.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,201
Messages
2,571,049
Members
47,655
Latest member
eizareri

Latest Threads

Top