Pattern matching French accented characters

Thomas Luedeke · Mar 1, 2011

I am writing a French conjugation testing script, and a significant
problem I have run into is how to pattern match the accented characters
used in the French language. For example, =C3=A9, =C3=A0, =C3=A8, =C3=AE=
, =C3=AF, etc.

I've tried a number of approaches, but can't seem to make it work.
After some research on the Internet, it may require a UTF-8 approach,
but I am not familiar with it.

As an example, assume I want to directly pattern match the French verb
ha=C3=AFr, and distinguish it from other verbs ending in -ir. How would =
I do
this?

Thanks in advance.

TPL

-- =

Posted via http://www.ruby-forum.com/.=

7stud -- · Mar 1, 2011

If you are not familiar with unicode, and you want to match utf-8
characters, then you better start reading some unicode tutorials. If
you are already familiar with unicode in general, then in ruby you can
set the $KCODE variable to 'U' for UTF-8, and then you can require the
jcode standard library, which will change the way regexes work--they
will match characters rather than single bytes.

See here:

http://blog.grayproductions.net/articles/the_kcode_variable_and_jcode_library

7stud -- · Mar 1, 2011

7stud -- wrote in post #984785:

If you are not familiar with unicode, and you want to match utf-8
characters, then you better start reading some unicode tutorials. If
you are already familiar with unicode in general, then in ruby you can
set the $KCODE variable to 'U' for UTF-8, and then you can require the
jcode standard library, which will change the way regexes work--they
will match characters rather than single bytes.

Uhhmm...you don't need to require 'jcode' to make regexes match
characters rather than bytes--just set $KCODE = 'U' (or 'UTF-8'). The
jcode library just gives you some methods like jsize to get the
character length rather than the byte length, which is what String#size
returns.

As an alternative, you can set the /u flag for a regex to make it match
characters rather than bytes.

7stud -- · Mar 2, 2011

7stud -- wrote in post #984789:

7stud -- wrote in post #984785:

Here is a short one, 'unicode in three rules':

1) Unicode assigns an integer to every letter in every alphabet in the
world. Currently, there are something like 100,000 letters.

2) Now the question becomes: what is the best way to store those unicode
integers (which represent characters) on a computer? The way in which
you decide to store a unicode integer on a computer is called an
"encoding".

For instance, you could use 4 bytes to store each unicode integer. In
that system, a series of unicode integers is very easy for ruby to
parse: every 4 bytes represents one unicode integer(which in turn
represents one character). If ruby blindly reads 4 byte chunks, then
each 4 byte chunk will be one uncode integer.

But you don't need 4 bytes to store, say, the unicode integer 60 because
three of those bytes would be empty. In fact, for all unicode integers
under 256 (which correspond to the letters in the Western alphabet),
three out of the four bytes would always be empty. Enter the UTF-8
encoding.

3) The UTF-8 encoding uses a variable number of bytes to store unicode
integers on your computer. For smaller unicode integers, UTF-8 stores
them in 1 byte, and for larger unicode integers, UTF-8 stores them in
2,3, or 4 bytes. But then how does ruby know how many bytes it should
read for each unicode integer? Well, UTF-8 has a tricky way of
signaling to ruby that the end of one unicode integer has been reached.
As long as you tell ruby that it is reading unicode integers stored in
the UTF-8 format, then ruby will will be able to sort out where one
unicode integer ends and the next one begins--even though each unicode
in

Thomas Luedeke · Mar 2, 2011

Thanks, guys. I'll take a shot in that direction.

UTF-8 support - still stuck	9	Mar 5, 2011
[ANN] Sinatra 1.1 released!	1	Oct 24, 2010
comp.lang.vhdl FAQ part 1 of 4: general	0	Jul 8, 2003
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Pattern matching French accented characters

Thomas Luedeke

7stud --

7stud --

7stud --

Thomas Luedeke

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads