not quite 1252

S

Serge Orlov

Anton said:
Or for example in firefox:

<text:s/>
in Amsterdam
<text:s/>

So, probably yes. If it doesn't have a text attribrute if you iterate
over it using OOopy for example:

o = OOoPy (infile = fname)
c = o.read ('content.xml')
for x in c.getiterator():
if x.text:

Then we know for sure you have recreated my other problem.

I'm tweaking a small test file and see that
<text:s/> is one space character
<text:s text:c="2"/> is two space characters
<text:s text:c="3"/> is three space characters
 
A

Anton Vredegoor

Martin said:
So if that is the case: What is the problem then? If you interpret
the document as cp1252, and it contains \x93 and \x94, what is
it that you don't like about that? In yet other words: what actions
are you performing, what are the results you expect to get, and
what are the results that you actually get?

Well, where do these cp1252 codes come from? The xml-file claims it's
utf-8.

I just tried out some random decodings and cp1252 seemed to work. I
don't like to have to guess this way. I think John wouldn't even allow
it :)

Anton
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Anton said:
Well, where do these cp1252 codes come from? The xml-file claims it's
utf-8.

Ah. Then the document is most likely right: \x94 can very well occur
in an UTF-8 file.
I just tried out some random decodings and cp1252 seemed to work. I
don't like to have to guess this way. I think John wouldn't even allow
it :)

Well, if the document is UTF-8, you should decode it as UTF-8, of
course.

Regards,
Martin
 
A

Anton Vredegoor

Martin said:
Well, if the document is UTF-8, you should decode it as UTF-8, of
course.

Thanks. This and:

http://en.wikipedia.org/wiki/UTF-8

solved my problem with understanding the encoding.

Anton

proof that I understand it now (please anyone, prove me wrong if you can):

from zipfile import ZipFile, ZIP_DEFLATED

def by80(seq):
it = iter(seq)
while it:
yield ''.join(it.next() for i in range(80))

def utfCheck(infn):
zin = ZipFile(infn, 'r', ZIP_DEFLATED)
data = zin.read('content.xml').decode('utf-8')
for line in by80(data):
print line.encode('1252')

def test():
infn = "xxx.sxw"
utfCheck(infn)

if __name__=='__main__':
test()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,294
Messages
2,571,511
Members
48,200
Latest member
SCPKatheri

Latest Threads

Top