encoding in lxml

J

jasiu85

Hey,

I have a problem with character encoding in LXML. Here's how it goes:

I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not. I parse the
document like this:

html_doc = HTML(string_with_document)

Then I retrieve some info from the document with XPath:

xpath_nodes = html_doc('/html/body/something')

Now I'm guaranteed that the xpath_nodes list contains only one
element. So I read it's content:

xpath_nodes[0].text

And I get exception here. The exception is coming from the text
property of an Element object. The problem is that the text contains a
non-utf8 character. LXML seems to be using strict decoding and I can't
find a way to make it ignore the error. Is there anything I can do to
retrieve the text without getting an exception?

Regards,

Mike
 
P

pjacobi.de

Hi Mike,
I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not.

There will be host of more lightweight solutions, but you can opt
to sanizite incominhg HTML with HTML Tidy (python binding available).

It will replace invalid UTF-8 bytes with U+FFFD. It will not
guess a better encoding to use.

If you are sure you don't have HTML sloppiness to correct but only
the
occasional wrong byte, even decoding (with fallback) and encoding
using
the standard codec package will do.

Regards,
Peter
 
S

Stefan Behnel

jasiu85 said:
I have a problem with character encoding in LXML. Here's how it goes:

I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not.

You can instantiate your own HTML parser and pass encoding="utf-8". That way,
when it's not UTF-8, you will get an exception at parse time, which allows you
to reparse the document with another encoding (say, ISO-8859-1) to get the
correct content.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,187
Members
46,730
Latest member
AudryNolan

Latest Threads

Top