Help on regular expression match

J

Johnny Lee

Hi,
I've met a problem in match a regular expression in python. Hope
any of you could help me. Here are the details:

I have many tags like this:
xxx<a href="http://xxx.xxx.xxx" xxx>xxx
xxx<a href="wap://xxx.xxx.xxx" xxx>xxx
xxx<a href="http://xxx.xxx.xxx" xxx>xxx
.....
And I want to find all the "http://xxx.xxx.xxx" out, so I do it
like this:
httpPat = re.compile("(<a )(href=\")(http://.*)(\")")
result = httpPat.findall(data)
I use this to observe my output:
for i in result:
print i[2]
Surprisingly I will get some output like this:
http://xxx.xxx.xxx">xxx</a>xxx
In fact it's filtered from this kind of source:
<a href="http://xxx.xxx.xxx">xxx</a>xxx"
But some result are right, I wonder how can I get the all the
answers clean like "http://xxx.xxx.xxx"? Thanks for your help.


Regards,
Johnny
 
F

Fredrik Lundh

Johnny said:
I've met a problem in match a regular expression in python. Hope
any of you could help me. Here are the details:

I have many tags like this:
xxx<a href="http://xxx.xxx.xxx" xxx>xxx
xxx<a href="wap://xxx.xxx.xxx" xxx>xxx
xxx<a href="http://xxx.xxx.xxx" xxx>xxx
.....
And I want to find all the "http://xxx.xxx.xxx" out, so I do it
like this:
httpPat = re.compile("(<a )(href=\")(http://.*)(\")")
result = httpPat.findall(data)
I use this to observe my output:
for i in result:
print i[2]
Surprisingly I will get some output like this:
http://xxx.xxx.xxx">xxx</a>xxx
In fact it's filtered from this kind of source:
<a href="http://xxx.xxx.xxx">xxx</a>xxx"
But some result are right, I wonder how can I get the all the
answers clean like "http://xxx.xxx.xxx"? Thanks for your help.

".*" gives the longest possible match (you can think of it as searching back-
wards from the right end). if you want to search for "everything until a given
character", searching for "[^x]*x" is often a better choice than ".*x".

in this case, I suggest using something like

print re.findall("href=\"([^\"]+)\"", text)

or, if you're going to parse HTML pages from many different sources, a
real parser:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):
if tag == "a":
for key, value in attrs:
if key == "href":
print value

p = MyHTMLParser()
p.feed(text)
p.close()

see:

http://docs.python.org/lib/module-HTMLParser.html
http://docs.python.org/lib/htmlparser-example.html
http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html

</F>
 
J

Johnny Lee

Fredrik said:
".*" gives the longest possible match (you can think of it as searching back-
wards from the right end). if you want to search for "everything until a given
character", searching for "[^x]*x" is often a better choice than ".*x".

in this case, I suggest using something like

print re.findall("href=\"([^\"]+)\"", text)

or, if you're going to parse HTML pages from many different sources, a
real parser:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):
if tag == "a":
for key, value in attrs:
if key == "href":
print value

p = MyHTMLParser()
p.feed(text)
p.close()

see:

http://docs.python.org/lib/module-HTMLParser.html
http://docs.python.org/lib/htmlparser-example.html
http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html

</F>

Thanks for your help.
I found another solution by just simply adding a '?' after ".*" which
makes the it searching for the minimal length to match the regular
expression.
To the HTMLParser, there is another problem (take my code for example):

import urllib
import formatter
parser = htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(urllib.urlopen(baseUrl).read())
parser.close()
for url in parser.anchorlist:
if url[0:7] == "http://":
print url

when the baseUrl="http://www.nba.com", there will raise an
HTMLParseError because of a line of code "<! Copyright IBM Corporation,
2001, 2002 !>". I found that this line of code is inside <script> tags,
maybe it's because of this?
 
J

John J. Lee

Fredrik Lundh said:
or, if you're going to parse HTML pages from many different sources, a
real parser:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):
if tag == "a":
for key, value in attrs:
if key == "href":
print value

p = MyHTMLParser()
p.feed(text)
p.close()

see:

http://docs.python.org/lib/module-HTMLParser.html
http://docs.python.org/lib/htmlparser-example.html
http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html

It's worth noting that module HTMLParser is less tolerant of the bad
HTML you find in the real world than is module sgmllib, which has a
similar interface. There are also third party libraries like
BeautifulSoup and mxTidy that you may find useful for parsing "HTML as
deployed" (ie. bad HTML, often).

Also, htmllib is an extension to sgmllib, and will do your link
parsing with even less effort:

import htmllib, formatter, urllib2
pp = htmllib.HTMLParser(formatter.NullFormatter())
pp.feed(urllib2.urlopen("http://python.org/").read())
print pp.anchorlist


Module HTMLParser does have better support for XHTML, though.


John
 
J

John J. Lee

Johnny Lee said:
Fredrik Lundh wrote: [...]
To the HTMLParser, there is another problem (take my code for example):

import urllib
import formatter
parser = htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(urllib.urlopen(baseUrl).read())
parser.close()
for url in parser.anchorlist:
if url[0:7] == "http://":
print url

when the baseUrl="http://www.nba.com", there will raise an
HTMLParseError because of a line of code "<! Copyright IBM Corporation,
2001, 2002 !>". I found that this line of code is inside <script> tags,
maybe it's because of this?

No, i's because they're using a broken HTML comment (should be
"<!--comment-->"). BeautifulSoup is more tolerant:

import urllib2
from BeautifulSoup import BeautifulSoup
bs = BeautifulSoup(urllib2.urlopen('http://www.nba.com/').read())
for el in bs.fetch('a'):
print el['href']


Or you could pre-process the HTML using mxTidy, and carry on using
module htmllib.

Hmm, are you the same Johnny Lee who contributed the MSIE cookie
support to LWP?


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top