Thomas said:
If you want to parse many HTML pages, you can use tidy to create
xml and then use an xml parser. There are too many ways HTML can be
broken.
including the page Anders pointed to, which is too broken for tidy's
default settings:
line 1 column 1 - Warning: specified input encoding (iso-8859-1) does
not match actual input encoding (utf-8)
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 3 column 1 - Warning: discarding unexpected <html>
line 9 column 1 - Error: <xml> is not recognized!
... snip ...
260 warnings, 14 errors were found! Not all warnings/errors were shown.
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
you can fix this either by tweaking the tidy settings, or by fixing up the
document before you parse it (note the first warning: if you're not care-
ful, you may end up with unusable swedish text).
I've attached a script based on my ElementTidy binding for tidy. see
alternative 1 below. usage:
URL = "
http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"
wordlist = parse_microsoft_wordlist(URL)
for item in wordlist:
print item
the wordlist contains (english word, swedish word), using Unicode where
appropriate.
you can get elementtree and elementtidy via
http://effbot.org/zone/element.htm
http://effbot.org/zone/element-tidylib.htm
on the other hand, for this specific case, a regular expression-based approach
is probably easier. see alternative 2 below for one way to do it.
</F>
# --------------------------------------------------------------------
# alternative 1: using the TIDY->XML approach
from elementtidy.TidyHTMLTreeBuilder import parse
from urllib import urlopen
from StringIO import StringIO
import re
def parse_microsoft_wordlist(url):
text = urlopen(url).read()
# get rid of BOM crud
text = re.sub("^[^<]*", "", text) # bom crud
# the page seems to be UTF-8 encoded, but it doesn't say so;
# convert it to Latin 1 to simplify further processing
text = unicode(text, "utf-8").encode("iso-8859-1")
# get rid of things that Tidy doesn't like
text = re.sub("(?i)</?xml*?>", "", text) # embedded <xml>
text = re.sub("(?i)</?ms.*?>", "", text) # <mshelp> stuff
# now, let's process it
tree = parse(StringIO(text))
# look for TR tags, and pick out the text from the first two TDs
wordlist = []
for row in tree.getiterator(XHTML("tr")):
cols = row.findall(XHTML("td"))
if len(cols) == 3:
wordlist.append((fixword(cols[0]), fixword(cols[1])))
return wordlist
# helpers
def XHTML(tag):
# map a tag to its XHTML name
return "{
http://www.w3.org/1999/xhtml}" + tag
def fixword(column):
# get text from TD and subelements
word = flatten(column)
# get rid of leading number and whitespace
word = re.sub("^\d+\.\s+", "", word)
return word
def flatten(node):
# get text from an element and all its subelements
text = ""
if node.text:
text += node.text
for subnode in node:
text += flatten(subnode)
if subnode.tail:
text += subnode.tail
return text
# --------------------------------------------------------------------
# alternative 2: using regular expressions
import re
from urllib import urlopen
def parse_microsoft_wordlist(url):
text = urlopen(url).read()
text = unicode(text, "utf-8")
pattern = "(?s)<tr>\s*<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>"
def fixword(word):
# get rid of leading nnn.
word = re.sub("^\d+\.\s+", "", word)
# get rid of embedded tags
word = re.sub("<[^>]+>", "", word)
return word
wordlist = []
for w1, w2 in re.findall(pattern, text):
wordlist.append((fixword(w1), fixword(w2)))
return wordlist
# --------------------------------------------------------------------