J
jasiu85
Hey,
I have a problem with character encoding in LXML. Here's how it goes:
I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not. I parse the
document like this:
html_doc = HTML(string_with_document)
Then I retrieve some info from the document with XPath:
xpath_nodes = html_doc('/html/body/something')
Now I'm guaranteed that the xpath_nodes list contains only one
element. So I read it's content:
xpath_nodes[0].text
And I get exception here. The exception is coming from the text
property of an Element object. The problem is that the text contains a
non-utf8 character. LXML seems to be using strict decoding and I can't
find a way to make it ignore the error. Is there anything I can do to
retrieve the text without getting an exception?
Regards,
Mike
I have a problem with character encoding in LXML. Here's how it goes:
I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not. I parse the
document like this:
html_doc = HTML(string_with_document)
Then I retrieve some info from the document with XPath:
xpath_nodes = html_doc('/html/body/something')
Now I'm guaranteed that the xpath_nodes list contains only one
element. So I read it's content:
xpath_nodes[0].text
And I get exception here. The exception is coming from the text
property of an Element object. The problem is that the text contains a
non-utf8 character. LXML seems to be using strict decoding and I can't
find a way to make it ignore the error. Is there anything I can do to
retrieve the text without getting an exception?
Regards,
Mike