Problem processing Chinese character with Python

Anthony Liu · Mar 6, 2004

Andrew gave me a sample code with let me read a text
file sentence by sentence.

Suppose I just wanna read the part between 2 full
stops each time.

It works nicely with English text files, where the
full stop is a dot (.).

But when I tried to read Chinese text files, I found
that it sometimes reads a few sentences at one time.

I guess the reason is that in Chinese, the full stop
is not a dot (.), but a little circle, as many of you
probably know.

Indeed, if I replace the Chinese full stop with the
dot. It nicely gets only one sentence each time.

So, how should I fix this problem? I am really having
headache processing Chinese characters with Python.

Here is the sample code that Andrew offered:

def bytes(f):
# Below: f.read(2) to process Chinese
for byte in iter(lambda: f.read(1), ''):
yield byte

def sentences(iterable):
sentence = ''
for char in iterable:
sentence += char
# The little cirlce is the Chinese
# full stop. Some of might not be able
# view it if you don't have
# east Asian language support.
if char in ('。','.'):
yield sentence.strip()
sentence = ''
sentence = sentence.strip()
if sentence:
yield sentence

__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com

Processing in Python help	0	Aug 31, 2022
Problem processing Chinese	1	Oct 14, 2005
Natural Language Processing with Python .dispersion_plot returns nothing	4	Jun 17, 2013
Help with Python Flask on PI as server SSE to website	0	Apr 23, 2022
Help with Loop	0	Mar 30, 2023
Text processing	29	Sep 26, 2011
Questions on character constants	2	Dec 12, 2010
Python hangs: Problem with wxPython, threading, pySerial, or events?	0	Oct 15, 2011

Problem processing Chinese character with Python

Anthony Liu

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads