Regular expression lookahead question

Z

zeebster

I am trying to parse an xml file for unprotected ampersands (&)
I do not want to match ampersand characters that preceed a protected
entity such as in the case of < or >
After hours of copious searches I found something that looked like it
would work using lookaheads:
&(?!quot|lt|gt|amp)
However this does not seem to work. Can anyone give me a clue what I am
doing wrong?
 
X

xhoster

zeebster said:
I am trying to parse an xml file for unprotected ampersands (&)
I do not want to match ampersand characters that preceed a protected
entity such as in the case of < or >
After hours of copious searches I found something that looked like it
would work using lookaheads:
&(?!quot|lt|gt|amp)

Don't you want to force a ';' after the entities?
However this does not seem to work. Can anyone give me a clue what I am
doing wrong?

Yes, you aren't explaining what you mean when you say it does not seem to
work. You aren't posting working code which demonstrates the problem
you claim to have. You aren't following the posting guidelines.

Xho
 
M

Matt Garrish

Don't you want to force a ';' after the entities?

It's also not a valid way to alternate options, and you still need to check
what is there:

/&(?:(?!(quot|lt|gt|amp)).*);/

Matt
 
M

Matt Garrish

Matt Garrish said:
It's also not a valid way to alternate options, and you still need to
check what is there:

/&(?:(?!(quot|lt|gt|amp)).*);/

Actually, I sent that off without checking thoroughly. The above would
happily consider &ampfoo; a valid entity.

Matt
 
M

Matt Garrish

zeebster said:
I am trying to parse an xml file for unprotected ampersands (&)
I do not want to match ampersand characters that preceed a protected
entity such as in the case of < or >
After hours of copious searches I found something that looked like it
would work using lookaheads:
&(?!quot|lt|gt|amp)

A negative lookahead is probably not the way to go:

my $str = 'Me & my &fudge. ';

my %entity = map { $_ => 1 } qw/amp lt gt quot/;

while ($str =~ /&([^\s;]+)([\s;])/g) {

print "No ending semi-colon: $1\n" if $2 =~ tr/ \r\n\t//;
print "Invalid entity: $1\n" unless $entity{$1};

}

Matt
 
M

Matt Garrish

Matt Garrish said:
zeebster said:
I am trying to parse an xml file for unprotected ampersands (&)
I do not want to match ampersand characters that preceed a protected
entity such as in the case of < or >
After hours of copious searches I found something that looked like it
would work using lookaheads:
&(?!quot|lt|gt|amp)

A negative lookahead is probably not the way to go:

my $str = 'Me & my &fudge. ';

my %entity = map { $_ => 1 } qw/amp lt gt quot/;

while ($str =~ /&([^\s;]+)([\s;])/g) {

print "No ending semi-colon: $1\n" if $2 =~ tr/ \r\n\t//;

I've really got to go eat. The above could just be:

print "No ending semi-colon: $1\n" unless $2 eq ';';


Matt
 
P

Peroli

hey,
This is out of curiosity i am asking... If u r scared about parsing
& charecters as entity in an xml file, y dont u use a CDATA section.
That way this problem will not have arised.

Peroli Sivaprakasam
 
A

A. Sinan Unur

hey,
This is out of curiosity i am asking... If u r scared about parsing
& charecters as entity in an xml file, y dont u use a CDATA section.
That way this problem will not have arised.

Please write in English.

Sinan.
 
P

Peroli

I think that the problem of parsing '&' chars is arose because
'zeebster' is trying to parse some PCDATA (something like the title tag
in sample below), where '&' charecters are a big threat. But if you use
a CDATA Section(desc tag) in an XML File any charecters are allowed
except ']]>'. I donno if thats english enough for you, but thats as
clear as i can explain.

<root>
<title>XML Parsing Problem</title>
<desc>
<[CDATA[
A negative lookahead is probably not the way to go:

my $str = 'Me &amp; my &fudge. ';

my %entity = map { $_ => 1 } qw/amp lt gt quot/;

while ($str =~ /&([^\s;]+)([\s;])/g) {

print "No ending semi-colon: $1\n" if $2 =~ tr/ \r\n\t//;
print "Invalid entity: $1\n" unless $entity{$1};
]]>
</desc>
</root>

Peroli Sivaprakasam
 
A

Arndt Jonasson

Tad McClellan said:
That was fine.

Please don't use "cutsie" spellings (r=are, u=you ...),
it is inconsiderate of your readers.

One wonders, does that apply to using only lower-case letters too?
Though I do admit to finding those articles easier to read than ones
with the cute spellings.
 
J

Jürgen Exner

Peroli said:
[...]... If u r scared about parsing
& charecters as entity in an xml file, y dont u use a CDATA section.

Ok, English is not my native language, but this sentence fails my parsing
routines so badly that even the exception handler gives up. I can't even
guess what you might have meant.

jue
 
Z

zeebster

I just wanted to know if my regexp syntax was correct.
Here is the situation in more detail. I have a stack of xml files that
were generated by a script that was good enough to protect the special
characters in most of the elements but left out a couple (there is a
filename element for instance that contains the full path to a file on
the users harderive and the path contains unprotected ampersands). I am
trying to do a search and replace of all these ampersand but I do not
want to match the ones that are in front of protected entities. For
example it should match Browning & Associates but not Browning &amp;
Associates. Of course the latter is the intended output after the
replace has finished. I ran the aforementioned regexp from the command
line using grep to test it and it returned 0 matches. This is what I
meant by it not working since I know there are dozens of these. I do
not yet have code to display since I was trying to understand the
regexp I should use first.
 
X

xhoster

zeebster said:
I just wanted to know if my regexp syntax was correct.

Were the syntax not correct, Perl would have notified you of this fact
by reporting a syntax error.

....
I ran the aforementioned regexp from the command
line using grep to test it and it returned 0 matches.

So, you obviously used some code to do this, no? Where is that code?
How do we know that it is the regex that failed to do it's jobs, as
opposed to the part of your code that responds (or is supposed to respond)
to the results returned by the regex?

This is what I
meant by it not working since I know there are dozens of these. I do
not yet have code to display since I was trying to understand the
regexp I should use first.

Obviously you do have code to display. You just choose not to display
it. No skin of my nose.

Xho
 
P

Peroli

hi Arndt Jonasson,
My apologies for writing it hardly readable. I will change my
style of writing. And for zeebster.... im sorry i have no solutions
currently for your problem.

Peroli Sivaprakasam
 
P

Peroli

hi Arndt Jonasson,
My apologies for writing it hardly readable. I will change my
style of writing. And for zeebster.... I am sorry i have no solutions
currently for your problem.

Peroli Sivaprakasam
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,166
Messages
2,570,901
Members
47,442
Latest member
KevinLocki

Latest Threads

Top