Parsing HTML

T

Thomas Guettler

Am Thu, 23 Sep 2004 08:42:08 +0200 schrieb Anders Eriksson:
Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

Hi,

If you only want to parse one page, I would use the re module.

If you want to parse many HTML pages, you can use tidy to create
xml and then use an xml parser. There are too many ways HTML can be
broken.

HTH,
Thomas
 
R

Richie Hindle

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) is perfect for
this job:

import urllib2, pprint
from BeautifulSoup import BeautifulSoup

def cellToWord(cell):
"""Given a table cell, return the word in that cell."""
# Some words are in bold.
if cell('b'):
return cell.first('b').string.strip() # Return the bold piece.
else:
return cell.string.split('.')[1].strip() # Remove the number.

def parse(url):
"""Parse the given URL and return a dictionary mapping US words to
foreign words."""

# Read the URL and pass it to BeautifulSoup.
html = urllib2.urlopen(url).read()
soup = BeautifulSoup()
soup.feed(html)

# Read the main table, extracting the words from the table cells.
USToForeign = {}
mainTable = soup.first('table')
rows = mainTable('tr')
for row in rows[1:]: # Exclude the first (headings) row.
cells = row('td')
if len(cells) == 3: # Some rows have a single colspan="3" cell.
US = cellToWord(cells[0])
foreign = cellToWord(cells[1])
USToForeign[US] = foreign

return USToForeign


if __name__ == '__main__':
url = 'http://msdn.microsoft.com/library/en-us/dnwue/html/FRE_word_list.htm'
USToForeign = parse(url)
pairs = USToForeign.items()
pairs.sort(lambda a, b: cmp(a[0].lower(), b[0].lower())) # Web page order
pprint.pprint(pairs)
 
R

Richie Hindle

[Richie]
BeautifulSoup is perfect for this job:

Um, BeautifulSoup may be perfect, but my script isn't. It fails with the
Swedish page because it doesn't cope with "<b></b>" appearing in the HTML.
And I don't know whether you'd consider it correct to extract only the bold
text from the entries that have bold text. But it gives you a place to start.
:cool:
 
C

Chris McD

Anders said:
Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

Please help!

// Anders

hi,
try this:

###############################################
import re, urllib2

#get page
s =
urllib2.urlopen('http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm').read()

regex = re.compile('<td.*?>\d*\. (?:<b>)?(.*?)(?:</b>)?</td>')
myresult = regex.findall(s)
#print myresult

# map pairs in list to key:value in dict
nwords = range(len(myresult))
mydict = {}
for i in range(min(nwords),max(nwords),2):
mydict[myresult] = myresult[i+1]

#print mydict

# try some words
print mydict['wizard']
print mydict['Web site']
print mydict['unavailable']

##############################

which outputs:
guide
webbplats
inte tillgänglig



Chris
 
?

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Richie said:
[Richie]
BeautifulSoup is perfect for this job:


Um, BeautifulSoup may be perfect, but my script isn't. It fails with the
Swedish page because it doesn't cope with "<b></b>" appearing in the HTML.
And I don't know whether you'd consider it correct to extract only the bold
text from the entries that have bold text. But it gives you a place to start.
:cool:

Another option might be the HTML parser from libxml2 (www.xmlsoft.org):
http://www.python.org:3: HTML parser error : htmlParseStartTag: invalid
element name
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "ht...

Bye,
Walter Dörwald
 
F

Fredrik Lundh

Thomas said:
If you want to parse many HTML pages, you can use tidy to create
xml and then use an xml parser. There are too many ways HTML can be
broken.

including the page Anders pointed to, which is too broken for tidy's
default settings:

line 1 column 1 - Warning: specified input encoding (iso-8859-1) does
not match actual input encoding (utf-8)
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 3 column 1 - Warning: discarding unexpected <html>
line 9 column 1 - Error: <xml> is not recognized!
... snip ...
260 warnings, 14 errors were found! Not all warnings/errors were shown.

This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

you can fix this either by tweaking the tidy settings, or by fixing up the
document before you parse it (note the first warning: if you're not care-
ful, you may end up with unusable swedish text).

I've attached a script based on my ElementTidy binding for tidy. see
alternative 1 below. usage:

URL = "http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"

wordlist = parse_microsoft_wordlist(URL)

for item in wordlist:
print item

the wordlist contains (english word, swedish word), using Unicode where
appropriate.

you can get elementtree and elementtidy via

http://effbot.org/zone/element.htm
http://effbot.org/zone/element-tidylib.htm

on the other hand, for this specific case, a regular expression-based approach
is probably easier. see alternative 2 below for one way to do it.

</F>

# --------------------------------------------------------------------
# alternative 1: using the TIDY->XML approach

from elementtidy.TidyHTMLTreeBuilder import parse
from urllib import urlopen
from StringIO import StringIO
import re

def parse_microsoft_wordlist(url):

text = urlopen(url).read()

# get rid of BOM crud
text = re.sub("^[^<]*", "", text) # bom crud

# the page seems to be UTF-8 encoded, but it doesn't say so;
# convert it to Latin 1 to simplify further processing
text = unicode(text, "utf-8").encode("iso-8859-1")

# get rid of things that Tidy doesn't like
text = re.sub("(?i)</?xml*?>", "", text) # embedded <xml>
text = re.sub("(?i)</?ms.*?>", "", text) # <mshelp> stuff

# now, let's process it
tree = parse(StringIO(text))

# look for TR tags, and pick out the text from the first two TDs
wordlist = []
for row in tree.getiterator(XHTML("tr")):
cols = row.findall(XHTML("td"))
if len(cols) == 3:
wordlist.append((fixword(cols[0]), fixword(cols[1])))
return wordlist

# helpers

def XHTML(tag):
# map a tag to its XHTML name
return "{http://www.w3.org/1999/xhtml}" + tag

def fixword(column):
# get text from TD and subelements
word = flatten(column)
# get rid of leading number and whitespace
word = re.sub("^\d+\.\s+", "", word)
return word

def flatten(node):
# get text from an element and all its subelements
text = ""
if node.text:
text += node.text
for subnode in node:
text += flatten(subnode)
if subnode.tail:
text += subnode.tail
return text

# --------------------------------------------------------------------
# alternative 2: using regular expressions

import re
from urllib import urlopen

def parse_microsoft_wordlist(url):

text = urlopen(url).read()

text = unicode(text, "utf-8")

pattern = "(?s)<tr>\s*<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>"

def fixword(word):
# get rid of leading nnn.
word = re.sub("^\d+\.\s+", "", word)
# get rid of embedded tags
word = re.sub("<[^>]+>", "", word)
return word

wordlist = []
for w1, w2 in re.findall(pattern, text):
wordlist.append((fixword(w1), fixword(w2)))

return wordlist

# --------------------------------------------------------------------
 
A

Anders Eriksson

I would like to thank everyone that have help on this!

The solution I settled for was a using BeautifulSoup and a script that Mr.
Leonard Richardson sent me.

Now to the next part of the problem, how to manage Unicode....


// Anders
--
To promote the usage of BeautifulSoup here is the script by Mr. Leonard
Richarson

import urllib
import re
from BeautifulSoup import BeautifulSoup

URL =
"http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"
text = urllib.urlopen(URL).read()
# remove all <b> and </b>
p = re.compile('\<b\>|\</b\>')
text = p.sub('',text)

# soupify it
soup = BeautifulSoup(text)

def unmunge(value):
"""Use this method to turn, eg "74. <b>Help</b> menu" into "Help menu",
probably using a regular expression."""
return value[value.find('.')+2:]

d = []
cols = soup.fetch('td', {'width' : '33%'})
for i in range(0, len(cols)):
if i % 3 != 2: #Every third column is a note which we ignore.
value = unmunge(cols.renderContents())
if not d or len(d[-1]) == 2:
#English term
d.append([value])
else:
#Swedish term
d[-1].append(value)
d = dict(d)
for key, val in d.items():
print "%s = %s" % (key, val)
 
U

Uche Ogbuji

Anders Eriksson said:
Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

http://www.xml.com/pub/a/2004/09/08/pyxml.html

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com

A hands-on introduction to ISO Schematron -
http://www-106.ibm.com/developerworks/edu/x-dw-xschematron-i.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Principles of XML design: Considering container elements -
http://www-106.ibm.com/developerworks/xml/library/x-contain.html
Hacking XML Hacks - http://www-106.ibm.com/developerworks/xml/library/x-think26.html
A survey of XML standards -
http://www-106.ibm.com/developerworks/xml/library/x-stand4/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,209
Messages
2,571,085
Members
47,683
Latest member
AustinFairchild

Latest Threads

Top