A
Anthony Liu
Andrew gave me a sample code with let me read a text
file sentence by sentence.
Suppose I just wanna read the part between 2 full
stops each time.
It works nicely with English text files, where the
full stop is a dot (.).
But when I tried to read Chinese text files, I found
that it sometimes reads a few sentences at one time.
I guess the reason is that in Chinese, the full stop
is not a dot (.), but a little circle, as many of you
probably know.
Indeed, if I replace the Chinese full stop with the
dot. It nicely gets only one sentence each time.
So, how should I fix this problem? I am really having
headache processing Chinese characters with Python.
Here is the sample code that Andrew offered:
def bytes(f):
# Below: f.read(2) to process Chinese
for byte in iter(lambda: f.read(1), ''):
yield byte
def sentences(iterable):
sentence = ''
for char in iterable:
sentence += char
# The little cirlce is the Chinese
# full stop. Some of might not be able
# view it if you don't have
# east Asian language support.
if char in ('。','.'):
yield sentence.strip()
sentence = ''
sentence = sentence.strip()
if sentence:
yield sentence
__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com
file sentence by sentence.
Suppose I just wanna read the part between 2 full
stops each time.
It works nicely with English text files, where the
full stop is a dot (.).
But when I tried to read Chinese text files, I found
that it sometimes reads a few sentences at one time.
I guess the reason is that in Chinese, the full stop
is not a dot (.), but a little circle, as many of you
probably know.
Indeed, if I replace the Chinese full stop with the
dot. It nicely gets only one sentence each time.
So, how should I fix this problem? I am really having
headache processing Chinese characters with Python.
Here is the sample code that Andrew offered:
def bytes(f):
# Below: f.read(2) to process Chinese
for byte in iter(lambda: f.read(1), ''):
yield byte
def sentences(iterable):
sentence = ''
for char in iterable:
sentence += char
# The little cirlce is the Chinese
# full stop. Some of might not be able
# view it if you don't have
# east Asian language support.
if char in ('。','.'):
yield sentence.strip()
sentence = ''
sentence = sentence.strip()
if sentence:
yield sentence
__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com