xml.dom.minidom: how to preserve CRLF's inside CDATA?

S

sim.sim

Hi all.
i'm faced to trouble using minidom:

#i have a string (xml) within CDATA section, and the section includes
"\r\n":
iInStr = '<?xml version="1.0"?>\n<Data><![CDATA[BEGIN:VCALENDAR\r
\nEND:VCALENDAR\r\n]]></Data>\n'


#After i create DOM-object, i get the value of "Data" without "\r\n"

from xml.dom import minidom
iDoc = minidom.parseString(iInStr)
iDoc.childNodes[0].childNodes[0].data # it gives u'BEGIN:VCALENDAR
\nEND:VCALENDAR\n'


according to http://www.w3.org/TR/REC-xml/#sec-line-ends

it looks normal, but another part of the documentation says that "only
the CDEnd string is recognized as markup": http://www.w3.org/TR/REC-xml/#sec-cdata-sect

so parser must (IMHO) give the value of CDATA-section "as is" (neither
both of parts of the document do not contradicts to each other).


How to get the value of CDATA-section with preserved all symbols
within? (perhaps use another parser - which one?)


Many thanks for any help.
 
K

kyosohma

Hi all.
i'm faced to trouble using minidom:

#i have a string (xml) within CDATA section, and the section includes
"\r\n":
iInStr = '<?xml version="1.0"?>\n<Data><![CDATA[BEGIN:VCALENDAR\r
\nEND:VCALENDAR\r\n]]></Data>\n'

#After i create DOM-object, i get the value of "Data" without "\r\n"

from xml.dom import minidom
iDoc = minidom.parseString(iInStr)
iDoc.childNodes[0].childNodes[0].data # it gives u'BEGIN:VCALENDAR
\nEND:VCALENDAR\n'

according tohttp://www.w3.org/TR/REC-xml/#sec-line-ends

it looks normal, but another part of the documentation says that "only
the CDEnd string is recognized as markup":http://www.w3.org/TR/REC-xml/#sec-cdata-sect

so parser must (IMHO) give the value of CDATA-section "as is" (neither
both of parts of the document do not contradicts to each other).

How to get the value of CDATA-section with preserved all symbols
within? (perhaps use another parser - which one?)

Many thanks for any help.

I'm thinking that the endline character "\n" is relevant for *nix
systems. So if you're running this on Windows, Python will translate
it automatically to "\r\n". According to Lutz's book, Programming
Python 3rd Ed, it's for historical reasons. It says that most text
editors handle text in Unix format, with the exception of Notepad,
which is why some documents are displayed as just one long line in
Notepad. (see pg 150 of said book).

The book goes on to talk about how to use a script that will check
this endline character and fix it depending on the platform you're
running under. The following link seems to do something along those
lines as well.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/435882

Not exactly helpful, but maybe it'll give you some insight into the
issue.

Mike
 
H

harvey.thomas

Hi all.
i'm faced to trouble using minidom:

#i have a string (xml) within CDATA section, and the section includes
"\r\n":
iInStr = '<?xml version="1.0"?>\n<Data><![CDATA[BEGIN:VCALENDAR\r
\nEND:VCALENDAR\r\n]]></Data>\n'

#After i create DOM-object, i get the value of "Data" without "\r\n"

from xml.dom import minidom
iDoc = minidom.parseString(iInStr)
iDoc.childNodes[0].childNodes[0].data # it gives u'BEGIN:VCALENDAR
\nEND:VCALENDAR\n'

according tohttp://www.w3.org/TR/REC-xml/#sec-line-ends

it looks normal, but another part of the documentation says that "only
the CDEnd string is recognized as markup":http://www.w3.org/TR/REC-xml/#sec-cdata-sect

so parser must (IMHO) give the value of CDATA-section "as is" (neither
both of parts of the document do not contradicts to each other).

How to get the value of CDATA-section with preserved all symbols
within? (perhaps use another parser - which one?)

Many thanks for any help.

You will lose the \r characters. From the document you referred to
"""
This section defines some symbols used widely in the grammar.

S (white space) consists of one or more space (#x20) characters,
carriage returns, line feeds, or tabs.

White Space
[3] S ::= (#x20 | #x9 | #xD | #xA)+

Note:

The presence of #xD in the above production is maintained purely for
backward compatibility with the First Edition. As explained in 2.11
End-of-Line Handling, all #xD characters literally present in an XML
document are either removed or replaced by #xA characters before any
other processing is done. The only way to get a #xD character to match
this production is to use a character reference in an entity value
literal.
"""
 
S

sim.sim

Hi all.
i'm faced to trouble using minidom:

#i have a string (xml) within CDATA section, and the section includes
"\r\n":
iInStr = '<?xml version="1.0"?>\n<Data><![CDATA[BEGIN:VCALENDAR\r
\nEND:VCALENDAR\r\n]]></Data>\n'

#After i create DOM-object, i get the value of "Data" without "\r\n"

from xml.dom import minidom
iDoc = minidom.parseString(iInStr)
iDoc.childNodes[0].childNodes[0].data # it gives u'BEGIN:VCALENDAR
\nEND:VCALENDAR\n'

according tohttp://www.w3.org/TR/REC-xml/#sec-line-ends

it looks normal, but another part of the documentation says that "only
the CDEnd string is recognized as markup":http://www.w3.org/TR/REC-xml/#sec-cdata-sect

so parser must (IMHO) give the value of CDATA-section "as is" (neither
both of parts of the document do not contradicts to each other).

How to get the value of CDATA-section with preserved all symbols
within? (perhaps use another parser - which one?)

Many thanks for any help.


Hi all, I have another problem with minidom and now it is really
critical.

Below the code that tryes to parse an well-formed xml, but it fails
with error message:
"not well-formed (invalid token): line 3, column 85"


from xml.dom import minidom

iMessage = "3c3f786d6c2076657273696f6e3d22312e30223f3e0a3c6d657373616\
7653e0a202020203c446174613e3c215b43444154415bd094d0b0d0bdd0bdd18bd0b5\
20d0bfd0bed0bfd183d0bbd18fd180d0bdd18bd18520d0b7d0b0d0bfd180d0bed181d\
0bed0b220d0bcd0bed0b6d0bdd0be20d183d187d0b8d182d18bd0b2d0b0d182d18c20\
d0bfd180d0b820d181d0bed0b1d181d182d0b2d0b5d0bdd0bdd18bd18520d180d0b5d\
0bad0bbd0b0d0bcd0bdd15d5d3e3c2f446174613e0a3c2f6d6573736167653e0a0a".\
decode('hex')

iMsgDom = minidom.parseString(iMessage)


The "problem" within CDATA-section: it consists a part of utf-8
encoded string
wich was splited (widely used for memory limited devices).

When minidom parses the xml-string, it fails becouse it tryes to
convert
into unicode the data within CDATA-section, insted of just to return
the value
of the section "as is". The convertion contradicts the specification
http://www.w3.org/TR/REC-xml/#sec-cdata-sect


So my question still open:

How to get the value of CDATA-section with preserved all symbols
within? (perhaps use another parser - which one?)

Thanks for help.

Maksim
 
M

Marc 'BlackJack' Rintsch

Below the code that tryes to parse an well-formed xml, but it fails
with error message:
"not well-formed (invalid token): line 3, column 85"

How did you verified that it is well formed? `xmllint` barf on it too.
The "problem" within CDATA-section: it consists a part of utf-8
encoded string wich was splited (widely used for memory limited
devices).

When minidom parses the xml-string, it fails becouse it tryes to convert
into unicode the data within CDATA-section, insted of just to return the
value of the section "as is". The convertion contradicts the
specification http://www.w3.org/TR/REC-xml/#sec-cdata-sect

An XML document contains unicode characters, so does the CDTATA section.
CDATA is not meant to put arbitrary bytes into a document. It must
contain valid characters of this type
http://www.w3.org/TR/REC-xml/#NT-Char (linked from the grammar of CDATA in
your link above).

Ciao,
Marc 'BlackJack' Rintsch
 
S

sim.sim

How did you verified that it is well formed? `xmllint` barf on it too.

you can try to write iMessage to file and open it using Mozilla
Firefox (web-browser)
An XML document contains unicode characters, so does the CDTATA section.
CDATA is not meant to put arbitrary bytes into a document. It must
contain valid characters of this typehttp://www.w3.org/TR/REC-xml/#NT-Char(linked from the grammar of CDATA in
your link above).

Ciao,
Marc 'BlackJack' Rintsch


my CDATA-section contains only symbols in the range specified for
Char:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]


filter(lambda x: ord(x) not in range(0x20, 0xD7FF), iMessage)
 
C

Carsten Haese

my CDATA-section contains only symbols in the range specified for
Char:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]


filter(lambda x: ord(x) not in range(0x20, 0xD7FF), iMessage)

That test is meaningless. The specified range is for unicode characters,
and your iMessage is a byte string, presumably utf-8 encoded unicode.

Let's try decoding it:
.... 7653e0a202020203c446174613e3c215b43444154415bd094d0b0d0bdd0bdd18bd0b5\
.... 20d0bfd0bed0bfd183d0bbd18fd180d0bdd18bd18520d0b7d0b0d0bfd180d0bed181d\
.... 0bed0b220d0bcd0bed0b6d0bdd0be20d183d187d0b8d182d18bd0b2d0b0d182d18c20\
.... d0bfd180d0b820d181d0bed0b1d181d182d0b2d0b5d0bdd0bdd18bd18520d180d0b5d\
.... 0bad0bbd0b0d0bcd0bdd15d5d3e3c2f446174613e0a3c2f6d6573736167653e0a0a".\
.... decode('hex')Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 176-177: invalid data
'\xd1]'

And that's your problem. In general you can't just truncate a utf-8
encoded string anywhere and expect the result to be valid utf-8. The
\xd1 at the very end of your CDATA section is the first byte of a
two-byte sequence that represents some unicode code-point between \u0440
and \u047f, but it's missing the second byte that says which one.

Whatever you're using to generate this data needs to be smarter about
splitting the unicode string. Rather than encoding and then splitting,
it needs to split first and then encode, or take some other measures to
make sure that it doesn't leave incomplete multibyte sequences at the
end.

HTH,
 
H

harvey.thomas

How did you verified that it is well formed? `xmllint` barf on it too.

you can try to write iMessage to file and open it using Mozilla
Firefox (web-browser)






An XML document contains unicode characters, so does the CDTATA section.
CDATA is not meant to put arbitrary bytes into a document. It must
contain valid characters of this typehttp://www.w3.org/TR/REC-xml/#NT-Char(linkedfrom the grammar of CDATA in
your link above).
Ciao,
Marc 'BlackJack' Rintsch

my CDATA-section contains only symbols in the range specified for
Char:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]

filter(lambda x: ord(x) not in range(0x20, 0xD7FF), iMessage)- Hide quoted text -

- Show quoted text -

You need to explicitly convert the string of UTF8 encoded bytes to a
Unicode string before parsing e.g.
unicodestring = unicode(encodedbytes, 'utf8')

Unless I messed up copying and pasting, your original string had an
erroneous byte immediately before ]]>. With that corrected I was able
to process the string correctly - the CDATA marked section consits
entirely of spaces and Cyrillic characters. As I noted earlier you
will lose \r characters as part of the basic XML processing.

HTH

Harvey
 
R

Richard Brodie

How did you verified that it is well formed?

It appears to have a more fundamental problem, which is
that it isn't correctly encoded (presumably because the
CDATA is truncated in mid-character). I'm surprised
Mozilla lets it slip by.
 
N

Neil Cerutti

It appears to have a more fundamental problem, which is
that it isn't correctly encoded (presumably because the
CDATA is truncated in mid-character). I'm surprised
Mozilla lets it slip by.

Web browsers are in the very business of reasonably rendering
ill-formed mark-up. It's one of the things that makes
implementing a browser take forever. ;)
 
R

Richard Brodie

Web browsers are in the very business of reasonably rendering
ill-formed mark-up. It's one of the things that makes
implementing a browser take forever. ;)

For HTML, yes. it accepts all sorts of garbage, like most
browsers; I've never, before now, seen it accept an invalid
XML document though.
 
M

Mattia Gentilini

Richard Brodie ha scritto:
For HTML, yes. it accepts all sorts of garbage, like most
browsers; I've never, before now, seen it accept an invalid
XML document though.
It *could* depend on Content-Type. I've seen that Firefox treats XHTML
as HTML (i.e. not trying to validate it) if you set Content-Type to
text/html. However, the same document with Content-Type
application/xhtml+xml is checked for well-formedness (if the DOM
inspector is installed). So probably Firefox treats that bad-encoded
document ad text/html (maybe as a failsafe setting), this could explain
why it accepts that.
 
M

Maksim Kasimov

Carsten Haese:
'utf8' codec can't decode bytes in position 176-177: invalid data
'\xd1]'

And that's your problem. In general you can't just truncate a utf-8
encoded string anywhere and expect the result to be valid utf-8. The
\xd1 at the very end of your CDATA section is the first byte of a
two-byte sequence that represents some unicode code-point between \u0440
and \u047f, but it's missing the second byte that says which one.


in previous message i've explain already that the situation widely appears with
memory limited devices, such as mobile terminals of Nokia, SonyEriccson, Siemens and so on.

and i've notice you that it is a part of a splited string.

Splited content it is a _standard_ in mobile world, and well described at http://www.openmobilealliance.org
and is _not_ contradicts xml-spec.


the problem is that pyexpat works _unproperly_.
 
M

Maksim Kasimov

(e-mail address removed) :
You need to explicitly convert the string of UTF8 encoded bytes to a
Unicode string before parsing e.g.
unicodestring = unicode(encodedbytes, 'utf8')


it is only a part of a string - not hole string, i've wrote it before.
That meens that the content can not be converted to unicode until reciever
program will get all parts of the utf-string from sender.

the xml in iMessage is absolutely correct. Please read my previous posts.

thanks.
 
M

Maksim Kasimov

Richard Brodie ÐÉÛÅÔ:
For HTML, yes. it accepts all sorts of garbage, like most
browsers; I've never, before now, seen it accept an invalid
XML document though.


I do not think, that will be constructive to discuss correctness of work Mozilla
insted to notice me on a contradiction in my message. Isn't it.


Try to browse any file with garbage with "xml" extension.
If you do, then you will see error message of XML-parser.


I insist - my message is correct and not contradicts no any point of w3.org xml-specification.
 
J

Jarek Zgoda

Maksim Kasimov napisa³(a):
'utf8' codec can't decode bytes in position 176-177: invalid data
iMessage[176:178]
'\xd1]'

And that's your problem. In general you can't just truncate a utf-8
encoded string anywhere and expect the result to be valid utf-8. The
\xd1 at the very end of your CDATA section is the first byte of a
two-byte sequence that represents some unicode code-point between \u0440
and \u047f, but it's missing the second byte that says which one.


in previous message i've explain already that the situation widely
appears with
memory limited devices, such as mobile terminals of Nokia, SonyEriccson,
Siemens and so on.

and i've notice you that it is a part of a splited string.

No, it is not a part of string. It's a part of byte stream, split in a
middle of multibyte-encoded character.

You cann't get only dot from small letter "i" and ask the parser to
treat it as a complete "i".
 
M

Maksim Kasimov

Jarek Zgoda:
No, it is not a part of string. It's a part of byte stream, split in a
middle of multibyte-encoded character.

You cann't get only dot from small letter "i" and ask the parser to
treat it as a complete "i".

.... i know it :))
can you propose something to solve it? ;)
 
C

Carsten Haese

I insist - my message is correct and not contradicts no any point of w3.org xml-specification.

The fact that you believe this so strongly and we disagree just as
strongly indicates a fundamental misunderstanding. Your fundamental
misunderstanding is between bytes and unicode code points.

The contents of an XML document is a sequence of unicode code points,
encoded into a sequence of bytes using some character encoding. The
<?xml...?> header should identify that encoding. In the absence of an
explicit encoding specification, the parser will guess what encoding the
content uses. In your case, the encoding is absent, and the parser
guesses utf-8, but your string is not a legible utf-8 string.

If you want to convey an arbitrary sequence of bytes as if they were
characters, you need to pick a character encoding that can handle an
arbitrary sequence of bytes. utf-8 can not do that. ISO-8859-1 can, but
you need to specify the encoding explicitly. Observe what happens if I
take your example and insert an encoding specification:
<Data><![CDATA[\xd0\x94\xd0\xb0\xd0\xbd\xd0\xbd\xd1\x8b\xd0\xb5 \xd0\xbf
\xd0\xbe\xd0\xbf\xd1\x83\xd0\xbb\xd1\x8f\xd1\x80\xd0\xbd\xd1\x8b\xd1\x85
\xd0\xb7\xd0\xb0\xd0\xbf\xd1\x80\xd0\xbe\xd1\x81\xd0\xbe\xd0\xb2 \xd0
\xbc\xd0\xbe\xd0\xb6\xd0\xbd\xd0\xbe \xd1\x83\xd1\x87\xd0\xb8\xd1\x82
\xd1\x8b\xd0\xb2\xd0\xb0\xd1\x82\xd1\x8c \xd0\xbf\xd1\x80\xd0\xb8 \xd1
\x81\xd0\xbe\xd0\xb1\xd1\x81\xd1\x82\xd0\xb2\xd0\xb5\xd0\xbd\xd0\xbd\xd1
\x8b\xd1\x85 \xd1\x80\xd0\xb5\xd0\xba\xd0\xbb\xd0\xb0\xd0\xbc\xd0\xbd
<xml.dom.minidom.Document instance at 0xb7c157ac>

Of course, when you extract your CDATA, it will come out as a unicode
string which you'll have to encode with ISO-8859-1 to turn it into a
sequence of bytes. Then you add the sequence of bytes from the next
message, and in the end that should yield a valid utf-8-encoded string
once you've collected and assembled all fragments.

Hope this helps,
 
M

Maksim Kasimov

Carsten Haese:
If you want to convey an arbitrary sequence of bytes as if they were
characters, you need to pick a character encoding that can handle an
arbitrary sequence of bytes. utf-8 can not do that. ISO-8859-1 can, but
you need to specify the encoding explicitly. Observe what happens if I
take your example and insert an encoding specification:
<Data><![CDATA[\xd0\x94\xd0\xb0\xd0\xbd\xd0\xbd\xd1\x8b\xd0\xb5 \xd0\xbf
\xd0\xbe\xd0\xbf\xd1\x83\xd0\xbb\xd1\x8f\xd1\x80\xd0\xbd\xd1\x8b\xd1\x85
\xd0\xb7\xd0\xb0\xd0\xbf\xd1\x80\xd0\xbe\xd1\x81\xd0\xbe\xd0\xb2 \xd0
\xbc\xd0\xbe\xd0\xb6\xd0\xbd\xd0\xbe \xd1\x83\xd1\x87\xd0\xb8\xd1\x82
\xd1\x8b\xd0\xb2\xd0\xb0\xd1\x82\xd1\x8c \xd0\xbf\xd1\x80\xd0\xb8 \xd1
\x81\xd0\xbe\xd0\xb1\xd1\x81\xd1\x82\xd0\xb2\xd0\xb5\xd0\xbd\xd0\xbd\xd1
\x8b\xd1\x85 \xd1\x80\xd0\xb5\xd0\xba\xd0\xbb\xd0\xb0\xd0\xbc\xd0\xbd
\xd1]]> said:
minidom.parseString(iMessage)
<xml.dom.minidom.Document instance at 0xb7c157ac>

Of course, when you extract your CDATA, it will come out as a unicode
string which you'll have to encode with ISO-8859-1 to turn it into a
sequence of bytes. Then you add the sequence of bytes from the next
message, and in the end that should yield a valid utf-8-encoded string
once you've collected and assembled all fragments.

Hope this helps,


Hi Carsten! Thanks for your suggestion - it is possible to fix the problem in that way.


BTW: i've found an "xmlproc" and use to try to parse with commandline tool xpcmd.py
it gives me
"Parse complete, 0 error(s) and 0 warning(s)"

I did not pick a character encoding "ISO-8859-1"

(but using the lib it is another problem: to recode/retest/redoc/re* a lot of things)

the project homepage: http://www.garshol.priv.no/download/software/xmlproc/


and another thing: I've open my xml-message in Mozilla again,
in pop-up menu select "Page info" item, it shows me:
Content-Type: text/xml
Encoding: UTF-8


Many thank for your attention and patience!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,999
Messages
2,570,243
Members
46,838
Latest member
KandiceChi

Latest Threads

Top