Regexp: Negation with backreference?

J

j.vimal

Hi
I would like to extract the anchors from a page. This is the simple
pattern I wrote:
/(<[aA]\\s[^>]*>[^<]*<\/a>)/

Note that it is to be used with a programming language, say php, but
the syntax is same that of Perl (almost) except for escape sequences.

Now, after I have got all the anchors, I want to parse them, to get the
href and title attributes.
For the href, I wrote

\\bhref\\s*=\\s*(["'])([^\\1])\\1

I search for href at the start of a word boundary, then skip spaces,
then the equal to, then skip spaces, then, I get the quotes. This is
reference 1. Now, I want to continue till I dont encounter the same
reference 1. Then, the last character is again reference 1.

So, is this syntax right? It doesnt seem to work for me ...

And, ofcourse, the quotes need not be there. I will change it :)

Thanks!
 
P

Paul Lalli

j.vimal said:
Hi
I would like to extract the anchors from a page. This is the simple
pattern I wrote:
/(<[aA]\\s[^>]*>[^<]*<\/a>)/

Wrong approach. Use an HTML Parsing module to parse HTML.
Note that it is to be used with a programming language, say php, but
the syntax is same that of Perl (almost) except for escape sequences.

Wow, coincidentally, this is almost a group that deals with languages
other than Perl!

comp.lang.php is over there---->

Paul Lalli
 
J

j.vimal

Ok ... But say I really want to do it this way, to learn Regexp, :)
Then ?

But why do you say that this is a wrong way? Are there performance
issues?
 
P

Paul Lalli

j.vimal said:
Ok ... But say I really want to do it this way, to learn Regexp, :)

There is no such thing. Regexps are not a universal concept. You can
not take on regular expression for Perl and just assume it will work
the same way in any other language.
Then ?

But why do you say that this is a wrong way? Are there performance
issues?

No, there are ability issues. Regular expressions cannot (correctly)
parse HTML.

Paul Lalli
 
J

j.vimal

Ok. Then, I think, or my purpose, it suits.
My purpose is just to visualize the various links in a given wikipedia
article. Since they follow a common method to address their links,
Regular expressions would serve my purpose without much overhead of a
HTML parser :)

Thanks
Vimal
 
X

Xicheng Jia

j.vimal said:
Hi
I would like to extract the anchors from a page. This is the simple
pattern I wrote:
/(<[aA]\\s[^>]*>[^<]*<\/a>)/

Note that it is to be used with a programming language, say php, but
the syntax is same that of Perl (almost) except for escape sequences.

Now, after I have got all the anchors, I want to parse them, to get the
href and title attributes.
For the href, I wrote

\\bhref\\s*=\\s*(["'])([^\\1])\\1

this pattern matches only "one" character between two quotes or in
$2.:)

And I guess [^\\1] does not work as you thought it might be [^"] or
[^']. you can try the non-greedy form of dot* which will immediately
jump to the next \1 and then backtrack:

\bhref\s*=\s*(["'])(.*?)\1

or you may use conditional construct if two balanced quotes are
optional:

\bhref\s*=\s*(["'])?(.*?)(?(1)\1|\s)
(untested)

BTW. why would you use double backslashes to escape those special
characters??

Xicheng
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,244
Members
46,839
Latest member
MartinaBur

Latest Threads

Top