S
Saul Spatz
Hi,
I'm just starting to learn a bit about Unicode. I want to be able to read autf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?
def codePoints(s):
''' return a list of the Unicode codepoints in the string s '''
answer = []
skip = False
for k, c in enumerate(s):
if skip:
skip = False
answer.append(ord(s[k-1:k+1]))
continue
if not 0xd800 <= ord(c) <= 0xdfff:
answer.append(ord(c))
else:
skip = True
return answer
if __name__ == '__main__':
s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
code = codePoints(s)
for c in code:
print('U+'+hex(c)[2:])
Thanks for any help you can give me.
Saul
I'm just starting to learn a bit about Unicode. I want to be able to read autf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?
def codePoints(s):
''' return a list of the Unicode codepoints in the string s '''
answer = []
skip = False
for k, c in enumerate(s):
if skip:
skip = False
answer.append(ord(s[k-1:k+1]))
continue
if not 0xd800 <= ord(c) <= 0xdfff:
answer.append(ord(c))
else:
skip = True
return answer
if __name__ == '__main__':
s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
code = codePoints(s)
for c in code:
print('U+'+hex(c)[2:])
Thanks for any help you can give me.
Saul