xsl and unicode surrogate characters

S

Sakcee

Hi

In one of the data files that I have , I am seeing these characters
\xed\xa0\xa0 . They seem to break the xsl.

---------------------------------------------------------------
Extra content at the end of the document
XML/XSL Error: </data><data ><![CDATA[ í Pls advice
----------------------------------------------------------------


this seems to break the libxml2/libxslt

is this a unicode utf-16 surrogate pair ?
for displaying it on xml/xsl, should I extract only \xa0?
since this is hingher than 00-7f range can i just strip it?
under what condition the encoding software put this string in?


thanks for help,
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Sakcee said:
Hi

In one of the data files that I have , I am seeing these characters
\xed\xa0\xa0 . They seem to break the xsl. [...]
is this a unicode utf-16 surrogate pair ?

Yes and no. This is the UTF-8 encoding of U+D820, which is a high
surrogate code point. So yes. It's not yet a pair; there would have to
be a second such code point. So no.

Furthermore, in UTF-8, you should never ever have encoded surrogate
codes; instead, whoever generated the UTF-8 should have combined the
two surrogate code point into a single coded character, and should
have encoded *that* character. So no - this byte sequence isn't
even valid UTF-8.
for displaying it on xml/xsl, should I extract only \xa0?

You should tell your parser to reject the file as ill-formed.
since this is hingher than 00-7f range can i just strip it?

Depending an what you want to achieve: sure! It will modify
the meaning of the bytes, of course.
under what condition the encoding software put this string in?

If it has a bug.

Regards,
Martin
 
S

Sakcee

thanks very much for the info, it really helped

we are using the text from file to display on webpage and we have a
method for conversion the parsed data to utf-8 and then displaying, all
the data looks fine after parsing except the
surrogate pair,
since i can not guess what it was supposed to be , is it ok to strip it
using regex re.complie(' [\xed|\xa0] ')?



Sakcee said:
Hi

In one of the data files that I have , I am seeing these characters
\xed\xa0\xa0 . They seem to break the xsl. [...]
is this a unicode utf-16 surrogate pair ?

Yes and no. This is the UTF-8 encoding of U+D820, which is a high
surrogate code point. So yes. It's not yet a pair; there would have to
be a second such code point. So no.

Furthermore, in UTF-8, you should never ever have encoded surrogate
codes; instead, whoever generated the UTF-8 should have combined the
two surrogate code point into a single coded character, and should
have encoded *that* character. So no - this byte sequence isn't
even valid UTF-8.
for displaying it on xml/xsl, should I extract only \xa0?

You should tell your parser to reject the file as ill-formed.
since this is hingher than 00-7f range can i just strip it?

Depending an what you want to achieve: sure! It will modify
the meaning of the bytes, of course.
under what condition the encoding software put this string in?

If it has a bug.

Regards,
Martin
 
D

Diez B. Roggisch

Sakcee said:
thanks very much for the info, it really helped

we are using the text from file to display on webpage and we have a
method for conversion the parsed data to utf-8 and then displaying, all
the data looks fine after parsing except the
surrogate pair,
since i can not guess what it was supposed to be , is it ok to strip it
using regex re.complie(' [\xed|\xa0] ')?

As martin said: that alters the meaning of the bytes. If that has to bother
you or not, that's yours to decide. If for example you stripped all vocals
from a text, it still might be comprehensible for most people, so if vocals
bother you for whatever reason, remove them.

Bt myb y bttr try nd fx th prblm n th frst plc.

Regards,

Diez
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,828
Latest member
LauraCastr

Latest Threads

Top