not quite 1252

Serge Orlov · Apr 28, 2006

Anton said:
Or for example in firefox:

<text:s/>
in Amsterdam
<text:s/>

So, probably yes. If it doesn't have a text attribrute if you iterate
over it using OOopy for example:

o = OOoPy (infile = fname)
c = o.read ('content.xml')
for x in c.getiterator():
if x.text:

Then we know for sure you have recreated my other problem.

I'm tweaking a small test file and see that
<text:s/> is one space character
<text:s text:c="2"/> is two space characters
<text:s text:c="3"/> is three space characters

Anton Vredegoor · Apr 28, 2006

Martin said:
So if that is the case: What is the problem then? If you interpret
the document as cp1252, and it contains \x93 and \x94, what is
it that you don't like about that? In yet other words: what actions
are you performing, what are the results you expect to get, and
what are the results that you actually get?

Well, where do these cp1252 codes come from? The xml-file claims it's
utf-8.

I just tried out some random decodings and cp1252 seemed to work. I
don't like to have to guess this way. I think John wouldn't even allow
it

Anton

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Apr 29, 2006

Anton said:
Well, where do these cp1252 codes come from? The xml-file claims it's
utf-8.

Ah. Then the document is most likely right: \x94 can very well occur
in an UTF-8 file.

I just tried out some random decodings and cp1252 seemed to work. I
don't like to have to guess this way. I think John wouldn't even allow
it

Well, if the document is UTF-8, you should decode it as UTF-8, of
course.

Regards,
Martin

Anton Vredegoor · Apr 29, 2006

Martin said:
Well, if the document is UTF-8, you should decode it as UTF-8, of
course.

Thanks. This and:

http://en.wikipedia.org/wiki/UTF-8

solved my problem with understanding the encoding.

Anton

proof that I understand it now (please anyone, prove me wrong if you can):

from zipfile import ZipFile, ZIP_DEFLATED

def by80(seq):
it = iter(seq)
while it:
yield ''.join(it.next() for i in range(80))

def utfCheck(infn):
zin = ZipFile(infn, 'r', ZIP_DEFLATED)
data = zin.read('content.xml').decode('utf-8')
for line in by80(data):
print line.encode('1252')

def test():
infn = "xxx.sxw"
utfCheck(infn)

if __name__=='__main__':
test()

Copy and indenting XML files	6	Feb 26, 2006
AJAX vs form submission (character encoding)	2	Jan 26, 2012
XML::PARSER utf-8 and japanese characters	1	Jul 28, 2004
generate and send mail with python: tutorial	8	Aug 11, 2011
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
PEP 383: Non-decodable Bytes in System Character Interfaces	1	Apr 22, 2009
David Mark's Javascript Daily - Volume #3 - Tip #6 - How to Get andSet HTML	6	Nov 15, 2011
ANN: lfm v2.2	0	May 22, 2010

not quite 1252

Serge Orlov

Anton Vredegoor

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Anton Vredegoor

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads