Help on regular expression match

Johnny Lee · Sep 23, 2005

Hi,
I've met a problem in match a regular expression in python. Hope
any of you could help me. Here are the details:

I have many tags like this:
xxx<a href="http://xxx.xxx.xxx" xxx>xxx
xxx<a href="wap://xxx.xxx.xxx" xxx>xxx
xxx<a href="http://xxx.xxx.xxx" xxx>xxx
.....
And I want to find all the "http://xxx.xxx.xxx" out, so I do it
like this:
httpPat = re.compile("(<a )(href=\")(http://.*)(\")")
result = httpPat.findall(data)
I use this to observe my output:
for i in result:
print i[2]
Surprisingly I will get some output like this:
http://xxx.xxx.xxx">xxx</a>xxx
In fact it's filtered from this kind of source:
<a href="http://xxx.xxx.xxx">xxx</a>xxx"
But some result are right, I wonder how can I get the all the
answers clean like "http://xxx.xxx.xxx"? Thanks for your help.

Regards,
Johnny

Fredrik Lundh · Sep 23, 2005

Johnny said:
I've met a problem in match a regular expression in python. Hope
any of you could help me. Here are the details:

I have many tags like this:
xxx<a href="http://xxx.xxx.xxx" xxx>xxx
xxx<a href="wap://xxx.xxx.xxx" xxx>xxx
xxx<a href="http://xxx.xxx.xxx" xxx>xxx
.....
And I want to find all the "http://xxx.xxx.xxx" out, so I do it
like this:
httpPat = re.compile("(<a )(href=\")(http://.*)(\")")
result = httpPat.findall(data)
I use this to observe my output:
for i in result:
print i[2]
Surprisingly I will get some output like this:
http://xxx.xxx.xxx">xxx</a>xxx
In fact it's filtered from this kind of source:
<a href="http://xxx.xxx.xxx">xxx</a>xxx"
But some result are right, I wonder how can I get the all the
answers clean like "http://xxx.xxx.xxx"? Thanks for your help.

".*" gives the longest possible match (you can think of it as searching back-
wards from the right end). if you want to search for "everything until a given
character", searching for "[^x]*x" is often a better choice than ".*x".

in this case, I suggest using something like

print re.findall("href=\"([^\"]+)\"", text)

or, if you're going to parse HTML pages from many different sources, a
real parser:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):
if tag == "a":
for key, value in attrs:
if key == "href":
print value

p = MyHTMLParser()
p.feed(text)
p.close()

see:

http://docs.python.org/lib/module-HTMLParser.html
http://docs.python.org/lib/htmlparser-example.html
http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html

</F>

Johnny Lee · Sep 23, 2005

Fredrik said:
".*" gives the longest possible match (you can think of it as searching back-
wards from the right end). if you want to search for "everything until a given
character", searching for "[^x]*x" is often a better choice than ".*x".

in this case, I suggest using something like

print re.findall("href=\"([^\"]+)\"", text)

or, if you're going to parse HTML pages from many different sources, a
real parser:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):
if tag == "a":
for key, value in attrs:
if key == "href":
print value

p = MyHTMLParser()
p.feed(text)
p.close()

see:

http://docs.python.org/lib/module-HTMLParser.html
http://docs.python.org/lib/htmlparser-example.html
http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html

</F>

Thanks for your help.
I found another solution by just simply adding a '?' after ".*" which
makes the it searching for the minimal length to match the regular
expression.
To the HTMLParser, there is another problem (take my code for example):

import urllib
import formatter
parser = htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(urllib.urlopen(baseUrl).read())
parser.close()
for url in parser.anchorlist:
if url[0:7] == "http://":
print url

when the baseUrl="http://www.nba.com", there will raise an
HTMLParseError because of a line of code "<! Copyright IBM Corporation,
2001, 2002 !>". I found that this line of code is inside <script> tags,
maybe it's because of this?

John J. Lee · Sep 24, 2005

Fredrik Lundh said:
or, if you're going to parse HTML pages from many different sources, a
real parser:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):
if tag == "a":
for key, value in attrs:
if key == "href":
print value

p = MyHTMLParser()
p.feed(text)
p.close()

see:

http://docs.python.org/lib/module-HTMLParser.html
http://docs.python.org/lib/htmlparser-example.html
http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html

It's worth noting that module HTMLParser is less tolerant of the bad
HTML you find in the real world than is module sgmllib, which has a
similar interface. There are also third party libraries like
BeautifulSoup and mxTidy that you may find useful for parsing "HTML as
deployed" (ie. bad HTML, often).

Also, htmllib is an extension to sgmllib, and will do your link
parsing with even less effort:

import htmllib, formatter, urllib2
pp = htmllib.HTMLParser(formatter.NullFormatter())
pp.feed(urllib2.urlopen("http://python.org/").read())
print pp.anchorlist

Module HTMLParser does have better support for XHTML, though.

John

John J. Lee · Sep 24, 2005

Johnny Lee said:
Fredrik Lundh wrote: [...]
To the HTMLParser, there is another problem (take my code for example):

import urllib
import formatter
parser = htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(urllib.urlopen(baseUrl).read())
parser.close()
for url in parser.anchorlist:
if url[0:7] == "http://":
print url

when the baseUrl="http://www.nba.com", there will raise an
HTMLParseError because of a line of code "<! Copyright IBM Corporation,
2001, 2002 !>". I found that this line of code is inside <script> tags,
maybe it's because of this?

No, i's because they're using a broken HTML comment (should be
""). BeautifulSoup is more tolerant:

import urllib2
from BeautifulSoup import BeautifulSoup
bs = BeautifulSoup(urllib2.urlopen('http://www.nba.com/').read())
for el in bs.fetch('a'):
print el['href']

Or you could pre-process the HTML using mxTidy, and carry on using
module htmllib.

Hmm, are you the same Johnny Lee who contributed the MSIE cookie
support to LWP?

John

Regular expression negative look-ahead	1	Jul 2, 2013
in place edit.	1	Nov 2, 2009
Regular expression problem	13	Mar 10, 2013
Help with regular expression patterns	0	Nov 28, 2008
Regular expression	0	Jul 21, 2009
Regular expression to structure HTML	11	Oct 2, 2009
What's the best way to write this regular expression?	41	Mar 6, 2012
Help with regular expression in python	1	Aug 18, 2011

Help on regular expression match

Johnny Lee

Fredrik Lundh

Johnny Lee

John J. Lee

John J. Lee

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads