Pattern matching French accented characters

T

Thomas Luedeke

I am writing a French conjugation testing script, and a significant
problem I have run into is how to pattern match the accented characters
used in the French language. For example, =C3=A9, =C3=A0, =C3=A8, =C3=AE=
, =C3=AF, etc.

I've tried a number of approaches, but can't seem to make it work.
After some research on the Internet, it may require a UTF-8 approach,
but I am not familiar with it.

As an example, assume I want to directly pattern match the French verb
ha=C3=AFr, and distinguish it from other verbs ending in -ir. How would =
I do
this?

Thanks in advance.

TPL

-- =

Posted via http://www.ruby-forum.com/.=
 
7

7stud --

If you are not familiar with unicode, and you want to match utf-8
characters, then you better start reading some unicode tutorials. If
you are already familiar with unicode in general, then in ruby you can
set the $KCODE variable to 'U' for UTF-8, and then you can require the
jcode standard library, which will change the way regexes work--they
will match characters rather than single bytes.

See here:

http://blog.grayproductions.net/articles/the_kcode_variable_and_jcode_library
 
7

7stud --

7stud -- wrote in post #984785:
If you are not familiar with unicode, and you want to match utf-8
characters, then you better start reading some unicode tutorials. If
you are already familiar with unicode in general, then in ruby you can
set the $KCODE variable to 'U' for UTF-8, and then you can require the
jcode standard library, which will change the way regexes work--they
will match characters rather than single bytes.

Uhhmm...you don't need to require 'jcode' to make regexes match
characters rather than bytes--just set $KCODE = 'U' (or 'UTF-8'). The
jcode library just gives you some methods like jsize to get the
character length rather than the byte length, which is what String#size
returns.

As an alternative, you can set the /u flag for a regex to make it match
characters rather than bytes.
 
7

7stud --

7stud -- wrote in post #984789:
7stud -- wrote in post #984785:

Here is a short one, 'unicode in three rules':

1) Unicode assigns an integer to every letter in every alphabet in the
world. Currently, there are something like 100,000 letters.

2) Now the question becomes: what is the best way to store those unicode
integers (which represent characters) on a computer? The way in which
you decide to store a unicode integer on a computer is called an
"encoding".

For instance, you could use 4 bytes to store each unicode integer. In
that system, a series of unicode integers is very easy for ruby to
parse: every 4 bytes represents one unicode integer(which in turn
represents one character). If ruby blindly reads 4 byte chunks, then
each 4 byte chunk will be one uncode integer.

But you don't need 4 bytes to store, say, the unicode integer 60 because
three of those bytes would be empty. In fact, for all unicode integers
under 256 (which correspond to the letters in the Western alphabet),
three out of the four bytes would always be empty. Enter the UTF-8
encoding.

3) The UTF-8 encoding uses a variable number of bytes to store unicode
integers on your computer. For smaller unicode integers, UTF-8 stores
them in 1 byte, and for larger unicode integers, UTF-8 stores them in
2,3, or 4 bytes. But then how does ruby know how many bytes it should
read for each unicode integer? Well, UTF-8 has a tricky way of
signaling to ruby that the end of one unicode integer has been reached.
As long as you tell ruby that it is reading unicode integers stored in
the UTF-8 format, then ruby will will be able to sort out where one
unicode integer ends and the next one begins--even though each unicode
in
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top