ElementTree and Unicode

  • Thread starter =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=
  • Start date
?

=?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=

I guess I am doing something wrong ... Any clue ?
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/site-packages/elementtree/ElementTree.py",
line 960, in XML
parser.feed(text)
File "/usr/lib/python2.4/site-packages/elementtree/ElementTree.py",
line 1242, in feed
self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 15

Cheers,

SB
 
R

Richard Brodie

I'm not as familiar with elementtree.ElementTree as I perhaps
should be. However, you appear to be trying to insert a null
character into an XML document. Should you succeed in this
quest, the resulting document will be ill-formed, and any
conforming parser will choke on it.
 
?

=?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=

Richard said:
I'm not as familiar with elementtree.ElementTree as I perhaps
should be. However, you appear to be trying to insert a null
character into an XML document. Should you succeed in this
quest, the resulting document will be ill-formed, and any
conforming parser will choke on it.

I am trying to embed an *arbitrary* (unicode) strings inside
an XML document. Of course I'd like to be able to reconstruct
it later from the xml document ... If the naive way to do it does
not work, can anyone suggest a way to do it ?

SB
 
M

Marc 'BlackJack' Rintsch

Sébastien said:
I am trying to embed an *arbitrary* (unicode) strings inside
an XML document. Of course I'd like to be able to reconstruct
it later from the xml document ... If the naive way to do it does
not work, can anyone suggest a way to do it ?

Encode it in UTF-8 and then Base64. AFAIK the only reliable way to put an
arbitrary string into XML and get exactly the same string back again.

Ciao,
Marc 'BlackJack' Rintsch
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Sébastien Boisgérault said:
I am trying to embed an *arbitrary* (unicode) strings inside
an XML document. Of course I'd like to be able to reconstruct
it later from the xml document ... If the naive way to do it does
not work, can anyone suggest a way to do it ?

XML does not support arbitrary Unicode characters; a few control
characters are excluded. See the definiton of Char in

http://www.w3.org/TR/2004/REC-xml-20040204

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]

Now, one might thing you could use a character reference
(e.g. ) to refer to the "missing" characters, but this is not so:


[66] CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';

Well-formedness constraint: Legal Character
Characters referred to using character references must match the
production for Char.

As others have explained, if you want to transmit arbitrary characters,
you need to encode it as text in some way. One obvious solution
would be to encode the Unicode data as UTF-8 first, and then encode
the UTF-8 bytes using base64. The receiver of the XML document then
must do the reverse.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Sébastien Boisgérault said:
I am trying to embed an *arbitrary* (unicode) strings inside
an XML document. Of course I'd like to be able to reconstruct
it later from the xml document ... If the naive way to do it does
not work, can anyone suggest a way to do it ?

XML does not support arbitrary Unicode characters; a few control
characters are excluded. See the definiton of Char in

http://www.w3.org/TR/2004/REC-xml-20040204

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]

Now, one might thing you could use a character reference
(e.g. ) to refer to the "missing" characters, but this is not so:


[66] CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';

Well-formedness constraint: Legal Character
Characters referred to using character references must match the
production for Char.

As others have explained, if you want to transmit arbitrary characters,
you need to encode it as text in some way. One obvious solution
would be to encode the Unicode data as UTF-8 first, and then encode
the UTF-8 bytes using base64. The receiver of the XML document then
must do the reverse.

Regards,
Martin
 
?

=?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=

Martin said:
Sébastien Boisgérault said:
I am trying to embed an *arbitrary* (unicode) strings inside
an XML document. Of course I'd like to be able to reconstruct
it later from the xml document ... If the naive way to do it does
not work, can anyone suggest a way to do it ?

XML does not support arbitrary Unicode characters; a few control
characters are excluded. See the definiton of Char in

http://www.w3.org/TR/2004/REC-xml-20040204

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]

Now, one might thing you could use a character reference
(e.g. ) to refer to the "missing" characters, but this is not so:


[66] CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';

Well-formedness constraint: Legal Character
Characters referred to using character references must match the
production for Char.

As others have explained, if you want to transmit arbitrary characters,
you need to encode it as text in some way. One obvious solution
would be to encode the Unicode data as UTF-8 first, and then encode
the UTF-8 bytes using base64. The receiver of the XML document then
must do the reverse.

Regards,
Martin

OK ! Thanks a lot for this helpful information.

Cheers,

SB
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,979
Messages
2,570,185
Members
46,728
Latest member
FernMcmull

Latest Threads

Top