UTF-8

M

Mark J. Reed

Okay, last I checked, strings were just treated as collections of bytes, and
any multibyte character semantics were up to the programmer to implement. But
I just noticed that in 1.8.3, utf8string.split(//) yeilds an array of
strings, each containing a single UTF-8 character, irrespective of byte
count.

So are regexes in general Unicode-aware now? Any other UTF-8 tidbits
in there I should know about?

Thanks!
 
P

Paul Battley

Okay, last I checked, strings were just treated as collections of bytes, = and
any multibyte character semantics were up to the programmer to implement.= But
I just noticed that in 1.8.3, utf8string.split(//) yeilds an array of
strings, each containing a single UTF-8 character, irrespective of byte
count.

So are regexes in general Unicode-aware now?

Regular expressions are UTF-8-aware if $KCODE is set to 'u' or there
is a u specifier after the regular expression (e.g. /./u). This is the
case since 1.8.2 at least (I don't have any other versions to hand to
check right at this moment, but I'm pretty confident that 1.8.1,
1.8.3, and 1.8.4 operate similarly).
Any other UTF-8 tidbits in there I should know about?

In regular expressions? You should be aware that /./u matches a UTF-8
codepoint, but ranges only work on byte values (e.g. /[\x00-\xff]/).
As UTF-8 sequences are distinct (that is, a byte sequence is not a
subset of a longer sequence with a different meaning), matching is not
generally a problem. When replacing, you have to make sure that you
aren't replacing a part of a byte sequence, or you'll end up with
illegal sequences.

Here's a UTF-8 regular expression trick to truncate a string safely:
string[/.{0,#{max_length}}/u]

There are plenty of other UTF-8 tricks to be done using pack/unpack
with 'U*', as well...

Paul.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,201
Messages
2,571,049
Members
47,655
Latest member
eizareri

Latest Threads

Top