Special chars with HTMLParser

Fafounet · Aug 5, 2009

Hello,

I am parsing a web page with special chars such as é (which
stands for é).
I know I can have the unicode character é from unicode
("\xe9","iso-8859-1")
but with those extra characters I don' t know.

I tried to implement handle_charref within HTMLParser without success.
Furthermore, if I have the data abécd, handle_data will get "ab",
handle_charref will get xe9 and then handle_data doesn't have the end
of the string ("cd").

Thank you for your help,
Fabien

Piet van Oostrum · Aug 5, 2009

Fafounet said:
F> Hello,
F> I am parsing a web page with special chars such as é (which
F> stands for é).
F> I know I can have the unicode character é from unicode
F> ("\xe9","iso-8859-1")
F> but with those extra characters I don' t know.

F> I tried to implement handle_charref within HTMLParser without success.
F> Furthermore, if I have the data abécd, handle_data will get "ab",
F> handle_charref will get xe9 and then handle_data doesn't have the end
F> of the string ("cd").

The character references indicate Unicode ordinals, not iso-8859-1
characters. In your example it will give the proper character because
iso-8859-1 coincides with the first part of the Unicode ordinals, but
for character outside of iso-8859-1 it will fail.

This should give you an idea:

from htmlentitydefs import name2codepoint
....
def handle_charref(self, name):
if name.startswith('x'):
num = int(name[1:], 16)
else:
num = int(name, 10)
print 'char:', repr(unichr(num))

def handle_entityref(self, name):
print 'char:', unichr(name2codepoint[name])

If your HTML may be illegal you should add some exception handling.

Fafounet · Aug 5, 2009

Thank you, now I can get the correct character.

Now when I have the string abécd I can get ab then é thanks to
your function and then cd. But how is it possible to know that cd is
still the same word ?

Fabien

Piet van Oostrum · Aug 5, 2009

Fafounet said:
F> Thank you, now I can get the correct character.
F> Now when I have the string abécd I can get ab then é thanks to
F> your function and then cd. But how is it possible to know that cd is
F> still the same word ?

That depends on your definition of `word'. And that is
language-dependent.

What you normally do is collect the text in a (unicode) string variable.
This happens in handle_data, handle_charref and handle_entityref.
Then you check that the previously collected stuff was a word (e.g.
consisting of Unicode letters), and that the new stuff also consists of
letters. If your language has additional word constituents like - or '
you have to add this.

You can do this with unicodedata.category or with a regular
expression. If your locale is correct \w in a regular expression may be
helpful.

Stefan Behnel · Aug 7, 2009

Fafounet said:
I am parsing a web page with special chars such as é (which
stands for é).
I know I can have the unicode character é from unicode
("\xe9","iso-8859-1")
but with those extra characters I don' t know.

I tried to implement handle_charref within HTMLParser without success.
Furthermore, if I have the data abécd, handle_data will get "ab",
handle_charref will get xe9 and then handle_data doesn't have the end
of the string ("cd").

Any reason you can't use a tree based HTML parser like the one in
lxml.html? That would eliminate this kind of problem altogether, as you'd
always get a well-decoded unicode string from the tree content.

Stefan

Unexpected behaviour with HTMLParser...	5	Oct 9, 2007
Tkinter special math chars	1	May 19, 2005
Problem with special chars	2	Mar 13, 2007
display VARCHAR(mysql) and special chars in html	6	Feb 21, 2005
XML Javascript Special Chars encoding	0	Feb 20, 2006
Unicode characters in btye-strings	5	Mar 12, 2010
IMAP4 search with special characters	0	Jul 21, 2006
Problem with minidom and special chars in HTML	6	Feb 22, 2005

Special chars with HTMLParser

Fafounet

Piet van Oostrum

Fafounet

Piet van Oostrum

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads