UTF-8

Mark J. Reed · Jan 6, 2006

Okay, last I checked, strings were just treated as collections of bytes, and
any multibyte character semantics were up to the programmer to implement. But
I just noticed that in 1.8.3, utf8string.split(//) yeilds an array of
strings, each containing a single UTF-8 character, irrespective of byte
count.

So are regexes in general Unicode-aware now? Any other UTF-8 tidbits
in there I should know about?

Thanks!

Paul Battley · Jan 6, 2006

Okay, last I checked, strings were just treated as collections of bytes, = and
any multibyte character semantics were up to the programmer to implement.= But
I just noticed that in 1.8.3, utf8string.split(//) yeilds an array of
strings, each containing a single UTF-8 character, irrespective of byte
count.

So are regexes in general Unicode-aware now?

Regular expressions are UTF-8-aware if $KCODE is set to 'u' or there
is a u specifier after the regular expression (e.g. /./u). This is the
case since 1.8.2 at least (I don't have any other versions to hand to
check right at this moment, but I'm pretty confident that 1.8.1,
1.8.3, and 1.8.4 operate similarly).

Any other UTF-8 tidbits in there I should know about?

In regular expressions? You should be aware that /./u matches a UTF-8
codepoint, but ranges only work on byte values (e.g. /[\x00-\xff]/).
As UTF-8 sequences are distinct (that is, a byte sequence is not a
subset of a longer sequence with a different meaning), matching is not
generally a problem. When replacing, you have to make sure that you
aren't replacing a part of a byte sequence, or you'll end up with
illegal sequences.

Here's a UTF-8 regular expression trick to truncate a string safely:
string[/.{0,#{max_length}}/u]

There are plenty of other UTF-8 tricks to be done using pack/unpack
with 'U*', as well...

Paul.

UTF-8 and strings	44	Jun 7, 2011
Unicode (UTF-8) in C	13	Mar 16, 2014
XMLRPC (REXML) incorrectly handles UTF-8 data	6	Nov 16, 2010
StringScanner and UTF-8 in ruby 1.9	0	Sep 16, 2009
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
converting UTF-8 to entities like 剛	14	May 9, 2009
CGI and UTF-8	14	Sep 28, 2009
How to use String.split to split a mixed encoding string(partencoded in gbk, part encoded in utf-8)	2	Mar 23, 2011

UTF-8

Mark J. Reed

Paul Battley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads