Text parser (text into sentences) that works with UTF-8 and multiple languages?

mike b. · Jul 30, 2007

Hi all,

I have to parse about 2000 files that are written in multiple
languages (some English, some Korean, some Arabic and some Japanese).
I have to split these UTF-8 encoded into individual sentences. Has
anyone written a good parser that can parse all these non-Latin
character languages or can someone give me some advice on how to go
about writing a parser that can handle all these fairly different
languages?

Thank you,

Mike

Robert Klemme · Jul 30, 2007

2007/7/30 said:
I have to parse about 2000 files that are written in multiple
languages (some English, some Korean, some Arabic and some Japanese).
I have to split these UTF-8 encoded into individual sentences. Has
anyone written a good parser that can parse all these non-Latin
character languages or can someone give me some advice on how to go
about writing a parser that can handle all these fairly different
languages?

I would consider doing this in Java, as Java's regular expressions
support Unicode. That might make the job much easier. OTOH, if all
files use only dot, question mark etc. (i.e. ASCII chars) as sentence
delimiters then Ruby's regular expressions might as well do the job.

Kind regards

robert

Oblomov · Jul 30, 2007

I would consider doing this in Java, as Java's regular expressions
support Unicode. That might make the job much easier. OTOH, if all
files use only dot, question mark etc. (i.e. ASCII chars) as sentence
delimiters then Ruby's regular expressions might as well do the job.

Ruby supports UTF-8 regular expressions: for example, /\w+|\W/u can be
used
to scan a string splitting it into words and non-words. There were
some bugs
with Unicode character classifications in older versions of Ruby, but
I'm not
aware of any in 1.8.6; OTOH I've never tried it with non-latin text so
I don't
know if it works correctly in those cases too.

James Edward Gray II · Jul 30, 2007

I have to parse about 2000 files that are written in multiple
languages (some English, some Korean, some Arabic and some Japanese).
I have to split these UTF-8 encoded into individual sentences.

As has been stated, Ruby's regular expression engine has a Unicode
mode and that may be all you need here, depending on how you
recognize sentence boundaries.

Has anyone written a good parser that can parse all these non-Latin
character languages or can someone give me some advice on how to go
about writing a parser that can handle all these fairly different
languages?

I've released an initial version of my Ghost Wheel parser generator
library. It doesn't have documentation yet, but it was built using
TDD and you should be able to look over the tests to see how it
works. I'm also happy to answer questions.

My hope is that it works fine for non-Latin languages, but I'll
confess that I haven't tested it that way yet. I would try to fix
any issues you uncovered though.

James Edward Gray II

Forcing a string to valid UTF-8	2	Apr 26, 2010
With this artifact, everyone can easily invent new languages	5	Jan 11, 2014
Server.HTMLEncode with UTF-8	1	Sep 15, 2006
Simple converter of files into their hex components... but i can'tarrange utf-8 parts!	2	Jun 9, 2013
codec for UTF-8 with BOM	3	May 2, 2011
XML::PARSER utf-8 and japanese characters	1	Jul 28, 2004
UTF-8 question from Dive into Python 3	19	Jan 17, 2011
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011

Text parser (text into sentences) that works with UTF-8 and multiple languages?

mike b.

Robert Klemme

Oblomov

James Edward Gray II

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads