Special chars with HTMLParser

F

Fafounet

Hello,

I am parsing a web page with special chars such as é (which
stands for é).
I know I can have the unicode character é from unicode
("\xe9","iso-8859-1")
but with those extra characters I don' t know.

I tried to implement handle_charref within HTMLParser without success.
Furthermore, if I have the data abécd, handle_data will get "ab",
handle_charref will get xe9 and then handle_data doesn't have the end
of the string ("cd").

Thank you for your help,
Fabien
 
P

Piet van Oostrum

Fafounet said:
F> Hello,
F> I am parsing a web page with special chars such as é (which
F> stands for é).
F> I know I can have the unicode character é from unicode
F> ("\xe9","iso-8859-1")
F> but with those extra characters I don' t know.
F> I tried to implement handle_charref within HTMLParser without success.
F> Furthermore, if I have the data abécd, handle_data will get "ab",
F> handle_charref will get xe9 and then handle_data doesn't have the end
F> of the string ("cd").

The character references indicate Unicode ordinals, not iso-8859-1
characters. In your example it will give the proper character because
iso-8859-1 coincides with the first part of the Unicode ordinals, but
for character outside of iso-8859-1 it will fail.

This should give you an idea:

from htmlentitydefs import name2codepoint
....
def handle_charref(self, name):
if name.startswith('x'):
num = int(name[1:], 16)
else:
num = int(name, 10)
print 'char:', repr(unichr(num))

def handle_entityref(self, name):
print 'char:', unichr(name2codepoint[name])

If your HTML may be illegal you should add some exception handling.
 
F

Fafounet

Thank you, now I can get the correct character.

Now when I have the string abécd I can get ab then é thanks to
your function and then cd. But how is it possible to know that cd is
still the same word ?


Fabien
 
P

Piet van Oostrum

Fafounet said:
F> Thank you, now I can get the correct character.
F> Now when I have the string abécd I can get ab then é thanks to
F> your function and then cd. But how is it possible to know that cd is
F> still the same word ?

That depends on your definition of `word'. And that is
language-dependent.

What you normally do is collect the text in a (unicode) string variable.
This happens in handle_data, handle_charref and handle_entityref.
Then you check that the previously collected stuff was a word (e.g.
consisting of Unicode letters), and that the new stuff also consists of
letters. If your language has additional word constituents like - or '
you have to add this.

You can do this with unicodedata.category or with a regular
expression. If your locale is correct \w in a regular expression may be
helpful.
 
S

Stefan Behnel

Fafounet said:
I am parsing a web page with special chars such as é (which
stands for é).
I know I can have the unicode character é from unicode
("\xe9","iso-8859-1")
but with those extra characters I don' t know.

I tried to implement handle_charref within HTMLParser without success.
Furthermore, if I have the data abécd, handle_data will get "ab",
handle_charref will get xe9 and then handle_data doesn't have the end
of the string ("cd").

Any reason you can't use a tree based HTML parser like the one in
lxml.html? That would eliminate this kind of problem altogether, as you'd
always get a well-decoded unicode string from the tree content.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top