Regular expression lookahead question

zeebster · Jan 13, 2005

I am trying to parse an xml file for unprotected ampersands (&)
I do not want to match ampersand characters that preceed a protected
entity such as in the case of < or >
After hours of copious searches I found something that looked like it
would work using lookaheads:
&(?!quot|lt|gt|amp)
However this does not seem to work. Can anyone give me a clue what I am
doing wrong?

xhoster · Jan 14, 2005

zeebster said:
I am trying to parse an xml file for unprotected ampersands (&)
I do not want to match ampersand characters that preceed a protected
entity such as in the case of < or >
After hours of copious searches I found something that looked like it
would work using lookaheads:
&(?!quot|lt|gt|amp)

Don't you want to force a ';' after the entities?

However this does not seem to work. Can anyone give me a clue what I am
doing wrong?

Yes, you aren't explaining what you mean when you say it does not seem to
work. You aren't posting working code which demonstrates the problem
you claim to have. You aren't following the posting guidelines.

Xho

Matt Garrish · Jan 14, 2005

Don't you want to force a ';' after the entities?

It's also not a valid way to alternate options, and you still need to check
what is there:

/&(?

?!(quot|lt|gt|amp)).*);/

Matt

Matt Garrish · Jan 14, 2005

Matt Garrish said:
It's also not a valid way to alternate options, and you still need to
check what is there:

/&(??!(quot|lt|gt|amp)).*);/

Actually, I sent that off without checking thoroughly. The above would
happily consider &ampfoo; a valid entity.

Matt

Matt Garrish · Jan 14, 2005

zeebster said:
I am trying to parse an xml file for unprotected ampersands (&)
I do not want to match ampersand characters that preceed a protected
entity such as in the case of < or >
After hours of copious searches I found something that looked like it
would work using lookaheads:
&(?!quot|lt|gt|amp)

A negative lookahead is probably not the way to go:

my $str = 'Me & my &fudge. ';

my %entity = map { $_ => 1 } qw/amp lt gt quot/;

while ($str =~ /&([^\s;]+)([\s;])/g) {

print "No ending semi-colon: $1\n" if $2 =~ tr/ \r\n\t//;
print "Invalid entity: $1\n" unless $entity{$1};

}

Matt

Matt Garrish · Jan 14, 2005

Matt Garrish said:
zeebster said:

I am trying to parse an xml file for unprotected ampersands (&)
I do not want to match ampersand characters that preceed a protected
entity such as in the case of < or >
After hours of copious searches I found something that looked like it
would work using lookaheads:
&(?!quot|lt|gt|amp)

Click to expand...

A negative lookahead is probably not the way to go:

my $str = 'Me & my &fudge. ';

my %entity = map { $_ => 1 } qw/amp lt gt quot/;

while ($str =~ /&([^\s;]+)([\s;])/g) {

print "No ending semi-colon: $1\n" if $2 =~ tr/ \r\n\t//;

I've really got to go eat. The above could just be:

print "No ending semi-colon: $1\n" unless $2 eq ';';

Matt

Peroli · Jan 14, 2005

hey,
This is out of curiosity i am asking... If u r scared about parsing
& charecters as entity in an xml file, y dont u use a CDATA section.
That way this problem will not have arised.

Peroli Sivaprakasam

A. Sinan Unur · Jan 14, 2005

hey,
This is out of curiosity i am asking... If u r scared about parsing
& charecters as entity in an xml file, y dont u use a CDATA section.
That way this problem will not have arised.

Please write in English.

Sinan.

Peroli · Jan 14, 2005

I think that the problem of parsing '&' chars is arose because
'zeebster' is trying to parse some PCDATA (something like the title tag
in sample below), where '&' charecters are a big threat. But if you use
a CDATA Section(desc tag) in an XML File any charecters are allowed
except ']]>'. I donno if thats english enough for you, but thats as
clear as i can explain.

<root>
<title>XML Parsing Problem</title>
<desc>
<[CDATA[
A negative lookahead is probably not the way to go:

my $str = 'Me & my &fudge. ';

my %entity = map { $_ => 1 } qw/amp lt gt quot/;

while ($str =~ /&([^\s;]+)([\s;])/g) {

print "No ending semi-colon: $1\n" if $2 =~ tr/ \r\n\t//;
print "Invalid entity: $1\n" unless $entity{$1};
]]>
</desc>
</root>

Peroli Sivaprakasam

Tad McClellan · Jan 14, 2005

I donno if thats english enough for you,

That was fine.

Please don't use "cutsie" spellings (r=are, u=you ...),
it is inconsiderate of your readers.

Arndt Jonasson · Jan 14, 2005

Tad McClellan said:
That was fine.

Please don't use "cutsie" spellings (r=are, u=you ...),
it is inconsiderate of your readers.

One wonders, does that apply to using only lower-case letters too?
Though I do admit to finding those articles easier to read than ones
with the cute spellings.

phaylon · Jan 14, 2005

Arndt said:
One wonders, does that apply to using only lower-case letters too?

Believe me, it's much more annoying in german.

p

Jürgen Exner · Jan 14, 2005

Peroli said:
[...]... If u r scared about parsing
& charecters as entity in an xml file, y dont u use a CDATA section.

Ok, English is not my native language, but this sentence fails my parsing
routines so badly that even the exception handler gives up. I can't even
guess what you might have meant.

jue

zeebster · Jan 14, 2005

I just wanted to know if my regexp syntax was correct.
Here is the situation in more detail. I have a stack of xml files that
were generated by a script that was good enough to protect the special
characters in most of the elements but left out a couple (there is a
filename element for instance that contains the full path to a file on
the users harderive and the path contains unprotected ampersands). I am
trying to do a search and replace of all these ampersand but I do not
want to match the ones that are in front of protected entities. For
example it should match Browning & Associates but not Browning &
Associates. Of course the latter is the intended output after the
replace has finished. I ran the aforementioned regexp from the command
line using grep to test it and it returned 0 matches. This is what I
meant by it not working since I know there are dozens of these. I do
not yet have code to display since I was trying to understand the
regexp I should use first.

xhoster · Jan 14, 2005

zeebster said:
I just wanted to know if my regexp syntax was correct.

Were the syntax not correct, Perl would have notified you of this fact
by reporting a syntax error.

....

I ran the aforementioned regexp from the command
line using grep to test it and it returned 0 matches.

So, you obviously used some code to do this, no? Where is that code?
How do we know that it is the regex that failed to do it's jobs, as
opposed to the part of your code that responds (or is supposed to respond)
to the results returned by the regex?

This is what I
meant by it not working since I know there are dozens of these. I do
not yet have code to display since I was trying to understand the
regexp I should use first.

Obviously you do have code to display. You just choose not to display
it. No skin of my nose.

Xho

xhoster · Jan 14, 2005

Matt Garrish said:
It's also not a valid way to alternate options,

What is not valid about it?

and you still need to
check what is there:

/&(??!(quot|lt|gt|amp)).*);/

Hunh?

Xho

Peroli · Jan 15, 2005

hi Arndt Jonasson,
My apologies for writing it hardly readable. I will change my
style of writing. And for zeebster.... im sorry i have no solutions
currently for your problem.

Peroli Sivaprakasam

Peroli · Jan 15, 2005

hi Arndt Jonasson,
My apologies for writing it hardly readable. I will change my
style of writing. And for zeebster.... I am sorry i have no solutions
currently for your problem.

Peroli Sivaprakasam

Regular Expression: Perl and vi	3	Aug 13, 2007
Regular expression bug?	11	Feb 19, 2009
FAQ 6.24 How do I match a regular expression that's in a variable?	0	Apr 19, 2011
Regular expression question.	15	Apr 5, 2005
REgular expression to match a XML tag	6	Nov 2, 2007
Multi-line regular expression match question	5	Nov 19, 2010
Help with regular expression in python	1	Aug 18, 2011
A regular expression query	2	Apr 23, 2007

Regular expression lookahead question

zeebster

xhoster

Matt Garrish

Matt Garrish

Matt Garrish

Matt Garrish

Peroli

A. Sinan Unur

Peroli

Tad McClellan

Arndt Jonasson

phaylon

Jürgen Exner

zeebster

xhoster

xhoster

Peroli

Peroli

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads