Tep said:
how can I replace '—' sign from string? Or do split at that
character?
Getting unicode error if I try to do it:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
position
1: ordinal not in range(128)
script is # -*- coding: UTF-8 -*-
[snip]
I just tried a bit of your code above in my interpreter here and it
worked fine:
|>>> data = 'foo — bar'
|>>> data.split('—')
|['foo ', ' bar']
|>>> data = u'foo — bar' |>>> data.split(u'—')
|[u'foo ', u' bar']
Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.
The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet
You'd still benefit from posting some code. You shouldn't be converting
back to utf-8 to do a split, you should be using a Unicode string with split
on the Unicode version of the "html source code". Also make sure your file
is actually saved in the encoding you declare. I print the encoding of your
symbol in two encodings to illustrate why I suspect this.
Below, assume "data" is your "html source code" as a Unicode string:
# -*- coding: UTF-8 -*-
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')
OUTPUT:
'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec codeObj in __main__.__dict__
File "<auto import>", line 1, in <module>
File "x.py", line 6, in <module>
print data.split('—')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
ordinal not in range(128)
Note that using the Unicode string in split() works. Also note the decode
byte in the error message when using a non-Unicode string to split the
Unicode data. In your original error message the decode byte that caused an
error was 0x97, which is 'EM DASH' in Windows-1252 encoding. Make sure to
save your source code in the encoding you declare. If I save the above
script in windows-1252 encoding and change the coding line to windows-1252 I
get the same results, but the decode byte is 0x97.
# coding: windows-1252
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')
'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec codeObj in __main__.__dict__
File "<auto import>", line 1, in <module>
File "x.py", line 6, in <module>
print data.split('ק)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
ordinal not in range(128)
-Mark