replace illegal xml characters

K

killkolor

hi!

I am working with InDesign exported xml and parse it in a python
application. I learned here: http://boodebr.org/main/python/all-about-python-and-unicode
that there actually are sets of illegal unicode characters for xml
(and henceforth for every compliant xml parser). I already implemented
a regex solution to replace the characters in question, but I wonder
if there is a efficient and out-of-the-box solution somewhere out
there for this problem. does anybody know?

thanks!
gabriel
 
M

Marc 'BlackJack' Rintsch

killkolor said:
I am working with InDesign exported xml and parse it in a python
application. I learned here: http://boodebr.org/main/python/all-about-python-and-unicode
that there actually are sets of illegal unicode characters for xml
(and henceforth for every compliant xml parser). I already implemented
a regex solution to replace the characters in question, but I wonder
if there is a efficient and out-of-the-box solution somewhere out
there for this problem. does anybody know?

Does InDesign export broken XML documents? What exactly is your problem?

Ciao,
Marc 'BlackJack' Rintsch
 
K

killkolor

Does InDesign export broken XML documents? What exactly is your problem?

yes, unfortunately it does. it uses all possible unicode characters,
though not all are alowed in valid xml (see link in the first post).
in any way for my application i should be checking if the xml that
comes in is valid and replace all non-valid characters. is there
something out there to do this?
 
K

kyosohma

yes, unfortunately it does. it uses all possible unicode characters,
though not all are alowed in valid xml (see link in the first post).
in any way for my application i should be checking if the xml that
comes in is valid and replace all non-valid characters. is there
something out there to do this?

You might be able to use "Beautiful Soup":

http://www.crummy.com/software/BeautifulSoup/

There are also some good examples for parsing XML at
http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/

and the Dive Into Python site.


Mike
 
D

Diez B. Roggisch

killkolor said:
yes, unfortunately it does. it uses all possible unicode characters,
though not all are alowed in valid xml (see link in the first post).
in any way for my application i should be checking if the xml that
comes in is valid and replace all non-valid characters. is there
something out there to do this?

I doubt it. Dealing with broken XML is nothing standard-modules should cope
with. The link you provided has all you need - why not just use it?


Diez
 
I

Irmen de Jong

killkolor said:
yes, unfortunately it does. it uses all possible unicode characters,
though not all are alowed in valid xml (see link in the first post).

Are you sure about this? Could you post a small example?

If this is true, don't forget to file a bug report with Adobe too.

--Irmen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,821
Latest member
AleidaSchi

Latest Threads

Top