Parsing HTML

Anders Eriksson · Sep 23, 2004

Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

Please help!

// Anders

Thomas Guettler · Sep 23, 2004

Am Thu, 23 Sep 2004 08:42:08 +0200 schrieb Anders Eriksson:

Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

Hi,

If you only want to parse one page, I would use the re module.

If you want to parse many HTML pages, you can use tidy to create
xml and then use an xml parser. There are too many ways HTML can be
broken.

HTH,
Thomas

Richie Hindle · Sep 23, 2004

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) is perfect for
this job:

import urllib2, pprint
from BeautifulSoup import BeautifulSoup

def cellToWord(cell):
"""Given a table cell, return the word in that cell."""
# Some words are in bold.
if cell('b'):
return cell.first('b').string.strip() # Return the bold piece.
else:
return cell.string.split('.')[1].strip() # Remove the number.

def parse(url):
"""Parse the given URL and return a dictionary mapping US words to
foreign words."""

# Read the URL and pass it to BeautifulSoup.
html = urllib2.urlopen(url).read()
soup = BeautifulSoup()
soup.feed(html)

# Read the main table, extracting the words from the table cells.
USToForeign = {}
mainTable = soup.first('table')
rows = mainTable('tr')
for row in rows[1:]: # Exclude the first (headings) row.
cells = row('td')
if len(cells) == 3: # Some rows have a single colspan="3" cell.
US = cellToWord(cells[0])
foreign = cellToWord(cells[1])
USToForeign[US] = foreign

return USToForeign

if __name__ == '__main__':
url = 'http://msdn.microsoft.com/library/en-us/dnwue/html/FRE_word_list.htm'
USToForeign = parse(url)
pairs = USToForeign.items()
pairs.sort(lambda a, b: cmp(a[0].lower(), b[0].lower())) # Web page order
pprint.pprint(pairs)

Richie Hindle · Sep 23, 2004

[Richie]

BeautifulSoup is perfect for this job:

Um, BeautifulSoup may be perfect, but my script isn't. It fails with the
Swedish page because it doesn't cope with "" appearing in the HTML.
And I don't know whether you'd consider it correct to extract only the bold
text from the entries that have bold text. But it gives you a place to start.

Chris McD · Sep 23, 2004

Anders said:
Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

Please help!

// Anders

hi,
try this:

###############################################
import re, urllib2

#get page
s =
urllib2.urlopen('http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm').read()

regex = re.compile('<td.*?>\d*\. (?:)?(.*?)(?:)?</td>')
myresult = regex.findall(s)
#print myresult

# map pairs in list to key:value in dict
nwords = range(len(myresult))
mydict = {}
for i in range(min(nwords),max(nwords),2):
mydict[myresult] = myresult[i+1]

#print mydict

# try some words
print mydict['wizard']
print mydict['Web site']
print mydict['unavailable']

##############################

which outputs:
guide
webbplats
inte tillgÃ¤nglig

Chris

=?ISO-8859-1?Q?Walter_D=F6rwald?= · Sep 23, 2004

Richie said:
[Richie]

BeautifulSoup is perfect for this job:

Click to expand...

Um, BeautifulSoup may be perfect, but my script isn't. It fails with the
Swedish page because it doesn't cope with "" appearing in the HTML.
And I don't know whether you'd consider it correct to extract only the bold
text from the entries that have bold text. But it gives you a place to start.

Another option might be the HTML parser from libxml2 (www.xmlsoft.org):
http://www.python.org:3: HTML parser error : htmlParseStartTag: invalid
element name
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "ht...

Bye,
Walter Dörwald

Fredrik Lundh · Sep 24, 2004

Thomas said:
If you want to parse many HTML pages, you can use tidy to create
xml and then use an xml parser. There are too many ways HTML can be
broken.

including the page Anders pointed to, which is too broken for tidy's
default settings:

line 1 column 1 - Warning: specified input encoding (iso-8859-1) does
not match actual input encoding (utf-8)
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 3 column 1 - Warning: discarding unexpected <html>
line 9 column 1 - Error: <xml> is not recognized!
... snip ...
260 warnings, 14 errors were found! Not all warnings/errors were shown.

This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

you can fix this either by tweaking the tidy settings, or by fixing up the
document before you parse it (note the first warning: if you're not care-
ful, you may end up with unusable swedish text).

I've attached a script based on my ElementTidy binding for tidy. see
alternative 1 below. usage:

URL = "http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"

wordlist = parse_microsoft_wordlist(URL)

for item in wordlist:
print item

the wordlist contains (english word, swedish word), using Unicode where
appropriate.

you can get elementtree and elementtidy via

http://effbot.org/zone/element.htm
http://effbot.org/zone/element-tidylib.htm

on the other hand, for this specific case, a regular expression-based approach
is probably easier. see alternative 2 below for one way to do it.

</F>

# --------------------------------------------------------------------
# alternative 1: using the TIDY->XML approach

from elementtidy.TidyHTMLTreeBuilder import parse
from urllib import urlopen
from StringIO import StringIO
import re

def parse_microsoft_wordlist(url):

text = urlopen(url).read()

# get rid of BOM crud
text = re.sub("^[^<]*", "", text) # bom crud

# the page seems to be UTF-8 encoded, but it doesn't say so;
# convert it to Latin 1 to simplify further processing
text = unicode(text, "utf-8").encode("iso-8859-1")

# get rid of things that Tidy doesn't like
text = re.sub("(?i)</?xml*?>", "", text) # embedded <xml>
text = re.sub("(?i)</?ms.*?>", "", text) # <mshelp> stuff

# now, let's process it
tree = parse(StringIO(text))

# look for TR tags, and pick out the text from the first two TDs
wordlist = []
for row in tree.getiterator(XHTML("tr")):
cols = row.findall(XHTML("td"))
if len(cols) == 3:
wordlist.append((fixword(cols[0]), fixword(cols[1])))
return wordlist

# helpers

def XHTML(tag):
# map a tag to its XHTML name
return "{http://www.w3.org/1999/xhtml}" + tag

def fixword(column):
# get text from TD and subelements
word = flatten(column)
# get rid of leading number and whitespace
word = re.sub("^\d+\.\s+", "", word)
return word

def flatten(node):
# get text from an element and all its subelements
text = ""
if node.text:
text += node.text
for subnode in node:
text += flatten(subnode)
if subnode.tail:
text += subnode.tail
return text

# --------------------------------------------------------------------
# alternative 2: using regular expressions

import re
from urllib import urlopen

def parse_microsoft_wordlist(url):

text = urlopen(url).read()

text = unicode(text, "utf-8")

pattern = "(?s)<tr>\s*<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>"

def fixword(word):
# get rid of leading nnn.
word = re.sub("^\d+\.\s+", "", word)
# get rid of embedded tags
word = re.sub("<[^>]+>", "", word)
return word

wordlist = []
for w1, w2 in re.findall(pattern, text):
wordlist.append((fixword(w1), fixword(w2)))

return wordlist

# --------------------------------------------------------------------

Anders Eriksson · Sep 27, 2004

I would like to thank everyone that have help on this!

The solution I settled for was a using BeautifulSoup and a script that Mr.
Leonard Richardson sent me.

Now to the next part of the problem, how to manage Unicode....

// Anders
--
To promote the usage of BeautifulSoup here is the script by Mr. Leonard
Richarson

import urllib
import re
from BeautifulSoup import BeautifulSoup

URL =
"http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"
text = urllib.urlopen(URL).read()
# remove all and 
p = re.compile('\<b\>|\</b\>')
text = p.sub('',text)

# soupify it
soup = BeautifulSoup(text)

def unmunge(value):
"""Use this method to turn, eg "74. Help menu" into "Help menu",
probably using a regular expression."""
return value[value.find('.')+2:]

d = []
cols = soup.fetch('td', {'width' : '33%'})
for i in range(0, len(cols)):
if i % 3 != 2: #Every third column is a note which we ignore.
value = unmunge(cols.renderContents())
if not d or len(d[-1]) == 2:
#English term
d.append([value])
else:
#Swedish term
d[-1].append(value)
d = dict(d)
for key, val in d.items():
print "%s = %s" % (key, val)

Uche Ogbuji · Oct 2, 2004

Anders Eriksson said:
Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

http://www.xml.com/pub/a/2004/09/08/pyxml.html

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com

A hands-on introduction to ISO Schematron -
http://www-106.ibm.com/developerworks/edu/x-dw-xschematron-i.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Principles of XML design: Considering container elements -
http://www-106.ibm.com/developerworks/xml/library/x-contain.html
Hacking XML Hacks - http://www-106.ibm.com/developerworks/xml/library/x-think26.html
A survey of XML standards -
http://www-106.ibm.com/developerworks/xml/library/x-stand4/

Translater + module + tkinter	1	Feb 16, 2023
Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022
insert html into ElementTree without parsing it	1	Feb 24, 2014
Parsing HTML	3	Feb 10, 2007
HTML Site Problems	11	Nov 25, 2019
Parsing HTML, extracting text and changing attributes.	9	Jun 18, 2007
HTML Parsing and Indexing	5	Nov 13, 2006
HOWTO: Parsing email using Python part2	1	Jul 15, 2011

Parsing HTML

Anders Eriksson

Thomas Guettler

Richie Hindle

Richie Hindle

Chris McD

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Fredrik Lundh

Anders Eriksson

Uche Ogbuji

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads