[ANN] 1.9 String and M17N documentation

Brian Candler · Aug 6, 2009

I have put together a document which tries to outline the M17N
properties of ruby 1.9 in a logical sequence and demonstrate the
important behaviours. The file is called string19.rb and you can find it
at

http://github.com/candlerb/string19

There is test code interspersed within the comments, so you can run it
to verify the behaviours described.

P.S.: I've spent enough time working on this that I felt entitled to add
another file, soapbox.rb, with my own opinion on all this. Feel free to
ignore it.

Gregory Brown · Aug 6, 2009

I have put together a document which tries to outline the M17N
properties of ruby 1.9 in a logical sequence and demonstrate the
important behaviours. The file is called string19.rb and you can find it
at

http://github.com/candlerb/string19

There is test code interspersed within the comments, so you can run it
to verify the behaviours described.

Clever approach and looks to be a great resource. Thanks for writing this up.

-greg

James Gray · Aug 6, 2009

I have put together a document which tries to outline the M17N
properties of ruby 1.9 in a logical sequence and demonstrate the
important behaviours. The file is called string19.rb and you can
find it
at

http://github.com/candlerb/string19

There is test code interspersed within the comments, so you can run it
to verify the behaviours described.

I just wanted to say that I enjoyed reading through what you have
created. I think you've shown a neat way to document behaviors, with
your comment and code mix. Even your simple alias of assert_equal()
to is() really adds to the overall presentation.

I've added a link to this repository in a comment to the first article
of my m17n series to help people find it.

It does run for me on Mac OS X, though I do get a warning:

$ ruby_dev string19.rb
Loaded suite string19
Started
WARNING: got "UTF-8" as locale_charmap for LANG=C

Brian Candler · Aug 6, 2009

James said:
I just wanted to say that I enjoyed reading through what you have
created. I think you've shown a neat way to document behaviors, with
your comment and code mix. Even your simple alias of assert_equal()
to is() really adds to the overall presentation.

Thanks James.

It does run for me on Mac OS X, though I do get a warning:

$ ruby_dev string19.rb
Loaded suite string19
Started
WARNING: got "UTF-8" as locale_charmap for LANG=C
.
Finished in 0.589675 seconds.

Hmm. Could you try setting replacing 'LANG' with 'LC_ALL' globally? A
reread of the setlocale(3) manpage under Linux shows that LANG is only
tried as a last resort, so perhaps your Mac has a higher-priority
environment variable set.

* I'm not sure this is correct:

# 5. If one object is a String which contains only 7-bit ASCII
characters
# (ascii_only?), then the objects are compatible and the result has the
# encoding of the other object.

Thank you, fixed.

* I don't believe this is accurate:

# Normally, writing a string to a file ignores the encoding property.

I think we crossed over on that one. I spotted the error after
re-reading your articles and posted a correction - I think it's right
now.

* You say that m17n's complexity can be avoided if we just used UTF-8
everywhere and transcoded incoming and outgoing data. I agree. If we
do that in Ruby 1.9 though, transcode all data as it comes in and just
work with UTF-8 internally, doesn't all the complexity of m17n go
away? Compatible encodings, the comparison order of differing
encodings, and the like will all be non-issues.

Yes, for scripts that process text. And in practice, this is what most
people processing text will find: their source is in their preferred
encoding, their external files are in their preferred encoding, and
everything "just works" - pretty much in the way that ruby 1.8 did with
$KCODE.

I have two key problems.

1. Working with binary. I can force the encoding on my own source files,
and I can force the encoding on any files that I open, but I still have
to interact with other libraries which return strings. If I build a
string by concatenating strings taken from elsewhere, I have to force
the encodings. If I forget, it may work sometimes (if those strings are
7-bit), but will fail if they are 8-bit.

Maybe this could be fixed by making the ASCII-8BIT encoding be
compatible with everything, and always give an ASCII-8BIT result. But
that would be saying, in essence, an ASCII-8BIT String is one class of
object, and everything else is another class.

2. Working with other people's libraries.

Take REXML as an example. Suppose I decide I want to do this:

doc = REXML:

ocument.new(src)

Under 1.8, I could do this without worrying. But under 1.9, a whole host
of questions tumble out.

- will REXML require me to have set the src to the correct encoding?
- in order to parse it, will it reset the encoding of my 'src' object?
What will it do if 'src' is frozen? Will it dup the string?

XML documents carry their encoding within them. There's the xml charset
declaration, and the BOM, and failing that the document is UTF-8 by
definition, because if it were in a different encoding, then it *must*
declare it:

http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding

So I reckon REXML should ignore the encoding of src. Even if it were
tagged as (say) ISO-8859-1 because that's the locale encoding, or
ASCII-8BIT because it came from a socket, it should be treated as UTF-8
unless declared otherwise. And then if I access the node using #text,
would I get something tagged as UTF-8, or something else?

The only way to be sure is to try it and see (and a quick test suggests
that it does work in the way I described).

But this process has to be repeated for every library you might use.

James Gray · Aug 6, 2009

Hmm. Could you try setting replacing 'LANG' with 'LC_ALL' globally? A
reread of the setlocale(3) manpage under Linux shows that LANG is only
tried as a last resort, so perhaps your Mac has a higher-priority
environment variable set.

I bet the issue is this line in my .bashrc:

export LC_CTYPE=en_US.UTF-8

I have two key problems.

1. Working with binary. I can force the encoding on my own source
files,
and I can force the encoding on any files that I open, but I still
have
to interact with other libraries which return strings. If I build a
string by concatenating strings taken from elsewhere, I have to force
the encodings. If I forget, it may work sometimes (if those strings
are
7-bit), but will fail if they are 8-bit.

Maybe this could be fixed by making the ASCII-8BIT encoding be
compatible with everything, and always give an ASCII-8BIT result. But
that would be saying, in essence, an ASCII-8BIT String is one class of
object, and everything else is another class.

I think I understand what you are saying here. You have a good point
that is would be annoying to have the Encoding of the JPEG you are
building up from ASCII-8BIT to UTF-8.

2. Working with other people's libraries.

Take REXML as an example. Suppose I decide I want to do this:

doc = REXML:ocument.new(src)

Under 1.8, I could do this without worrying.

Really?

What did it do under Ruby 1.8 when fed an XML document that was UTF-16
encoded? Will it read it? When I do searches for content, will it
hand me UTF-16 or UTF-8? These are just some questions that jump to
my mind.

As you've said, about the best I can think of is to test it and find
out, only this is Ruby 1.8 I'm talking about here.

Let's see how it works:

$ ruby -r rexml/document -e 'REXML:

ocument.new(ARGF.read)'
utf16_with_bom.xml
/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in `parse':
#<Iconv::InvalidCharacter:
"\340
\250
\274
\347
\215
\257
\346
\265
\245
\347
\221
\241
\346
\234
\276
\345
\215
\257
\346
\265
\245
\342
\201
\203
\346
\275
\256
\347
\221
\245
\346
\271\264\343\260\257\347\215\257\346\265\245\347\221\241\346\234\276",
["\n"]> (REXML:

arseException)
/usr/local/lib/ruby/1.8/rexml/encodings/ICONV.rb:7:in `conv'
/usr/local/lib/ruby/1.8/rexml/encodings/ICONV.rb:7:in `decode'
/usr/local/lib/ruby/1.8/rexml/source.rb:57:in `encoding='
/usr/local/lib/ruby/1.8/rexml/parsers/baseparser.rb:213:in `pull'
/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:22:in `parse'
/usr/local/lib/ruby/1.8/rexml/document.rb:227:in `build'
/usr/local/lib/ruby/1.8/rexml/document.rb:43:in `initialize'
-e:1:in `new'
-e:1
...
"\n"
Line:
Position:
Last 80 unconsumed characters:
<sometag>Some Content</sometag>
from /usr/local/lib/ruby/1.8/rexml/document.rb:227:in `build'
from /usr/local/lib/ruby/1.8/rexml/document.rb:43:in `initialize'
from -e:1:in `new'
from -e:1

Ah, it just tells me my data is invalid. It's not though:

$ iconv -f UTF-16BE -t UTF-8 < utf16_with_bom.xml
<?xml version="1.0" encoding="UTF-16BE"?>
<sometag>Some Content</sometag>

Ruby 1.9 can read it:

$ ruby_dev -r rexml/document -e 'puts
REXML:

ocument.new(ARGF.read.force_encoding("BINARY")).to_s'
utf16_with_bom.xml
<?xml version='1.0' encoding='UTF-16BE'?>
<sometag>Some Content</sometag>

It looks like it's suppose to work in Ruby 1.8 too and I've just hit a
bug. At least, if I'm reading the source right. I had to check.

Anyway, the point of all this is that it really isn't any easier, for
me, to reason about Ruby 1.8 encoding behavior. Ruby 1.9 didn't
invent character encodings, it just started paying attention to them
as we all should have been doing all along. That's all my opinion, of
course.

James Edward Gray II

lith · Aug 6, 2009

Ruby 1.9 didn't invent character encodings

Just out of curiosity. Are there other languages that handle encodings
the way ruby 1.9 does?

Eric Hodel · Aug 7, 2009

* You say that m17n's complexity can be avoided if we just used
UTF-8 everywhere and transcoded incoming and outgoing data. I
agree. If we do that in Ruby 1.9 though, transcode all data as it
comes in and just work with UTF-8 internally, doesn't all the
complexity of m17n go away? Compatible encodings, the comparison
order of differing encodings, and the like will all be non-issues.
Thus it seems to me that m17n allows us to take this favored
approach or take harder roads, if we so choose.

I'm too lazy to dig this out of the archives, but there are some
encodings that don't have a 1:1 mapping to Unicode thus the round-trip
through UTF-8 (etc.) will destroy them.

In short, Ruby doesn't transcode everything to preserve the integrity
of your data.

Brian Candler · Aug 7, 2009

Eric said:
I'm too lazy to dig this out of the archives, but there are some
encodings that don't have a 1:1 mapping to Unicode thus the round-trip
through UTF-8 (etc.) will destroy them.

Indeed, although we're both having a hard time thinking of an actual
example. It seems that dealing with such things is not an everyday
requirement for most people. So you write a library for that, and then
the rest of us aren't saddled with the complexity.

Brian Candler · Aug 7, 2009

James said:
Really?

What did it do under Ruby 1.8 when fed an XML document that was UTF-16
encoded? Will it read it? When I do searches for content, will it
hand me UTF-16 or UTF-8? These are just some questions that jump to
my mind.

OK, I didn't write my statement clearly enough.

In ruby 1.8, the question is, "will it parse this document?"

In ruby 1.9, the question is, "will it parse this document, *and* does
the correct parsing depend on which encoding I set the 'src' string to,
and if so, what do I need to set it to?"

Then take a method which returns a string, say REXML::Element.text().
This is a bit simpler.

In ruby 1.8, the question is, "does this return the content of my
element, and has it been transcoded?"

In ruby 1.9, the question is the same, *plus* "what encoding does it set
on that value?"

OK, so it looks like REXML has transcoded to UTF-8, and tagged the
result as such. I'm not really helping my case because you have to do
the same test with 1.8:

=> said:
require 'rexml/document' => true
d = REXML:ocument.new("<?xml encoding='iso-8859-1'?><root>\xfcber</root>")

Click to expand...

=> said:

d.elements[1]

Click to expand...

Click to expand...

=> said:

d.elements[1].text

Click to expand...

=> "\303\274ber"

So it's been transcoded here too. But I don't have to worry about what
encoding 'tag' it has been given.

Maybe all this would be much simpler if Ruby didn't crash when given
incompatible encodings, but transcoded the right-hand-side
automatically. For example:

a << b
# a keeps its original encoding, b is transcoded to a's encoding

a.tr("Ã¼","Ãœ")
# the Ã¼ and Ãœ are transcoded to a's encoding first

- with transcoding to BINARY being a null operation.

Eric Hodel · Aug 11, 2009

Indeed, although we're both having a hard time thinking of an actual
example. It seems that dealing with such things is not an everyday
requirement for most people.

This seems to be similar to the reasoning behind two-digit years.

So you write a library for that, and then the rest of us aren't
saddled with the complexity.

Unfortunately, software ends up getting used in places the author
didn't expect. Why not write robust software the first time instead
of being lazy?

Brian Candler · Aug 12, 2009

Eric said:
This seems to be similar to the reasoning behind two-digit years.

I don't understand what you're getting at. Obviously the round trip
4-digit-years -> 2-digit-years -> 4-digit-years is not lossless, but
that would be a silly thing to do (i.e. if you've captured
4-digit-years, then you store them and work with them as 4-digit-years).

You're saying you want to avoid external->Unicode->external encoding
transcodings. But these are rarely problematic (I've still not seen an
example), and in those rare cases you could just handle the external
encoding as binary data. Remember also that for stateful encodings,
you're forced to transcode anyway - even ruby 1.9 won't handle snippets
of ISO_2022_JP in isolation, for example.

Unfortunately, software ends up getting used in places the author
didn't expect. Why not write robust software the first time instead
of being lazy?

In My Opinion (which may not be shared by anyone else), ruby 1.9's
String implementation is anything but robust. It's over-complicated,
under-specified, buggy as hell, and badly gets in the way when you want
to work with binary data or write programs which don't crash when given
unexpected input.

If it were optional, it would be fine. Since it's a mandatory part of
the language, it destroys it for me. Ruby 1.8 is a fine general purpose
language; ruby 1.9 is a text-processing language (and may still trip you
up even in that case)

Regards,

Brian.

Brian Candler · Aug 12, 2009

BTW, I find James's writeup of what he had to do to the CSV library (*)
enlightening. Even ruby 1.9 won't match an ASCII regexp like /,/ against
a wide encoding, so he had to generate new regexps dynamically at
runtime.

Now, I think that's a good thing, optimising the regexps to match the
incoming data stream efficiently. But I also observe that this would
have worked just fine if the encoding were a property of the regexp only
- which is the approach 1.8 takes to regexps. What I mean is, once you
have decided to build a "UTF-16LE" regexp, say, you can just match it
against a stream of bytes.

Making every single String also have an encoding property only gives
more opportunities for Ruby to raise exceptions. Some may argue this is
Ruby "protecting" you from doing something silly, but if I'm working
with string literals or binary data returned from a library, whose
encoding may or may not have been set to ASCII-8BIT, then I don't want
this "protection". Rather, I need protecting against ruby 1.9.

There is only one case I can see where having the encoding be a property
of the String itself is useful: selecting individual characters by
index. e.g.

if str.size > 50
str = str[0,47] + "..."
end

There's a huge amount of language pain introduced just for that.

Regards,

Brian.

(*) http://blog.grayproductions.net/articles/what_ruby_19_gives_us

Brian Candler · Aug 12, 2009

James said:
* Just FYI, you ask the following about Regexp::FIXEDENCODING:

# FIXME: What is the purpose of this flag?

I do try to explain that under Regexp Encodings in this article, if
you are interested:

http://blog.grayproductions.net/articles/miscellaneous_m17n_details

"A fixed_encoding?() Regexp is one that will raise an
Encoding::CompatibilityError if matched against any String that contains
a different Encoding from the Regexp itself."

I think that's not exactly correct:

$ irb19 --simple-prompt=> 0

AFAICS, it will only raise an error if the matched string is of a
different encoding *and* is not ascii_only?
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8
regexp with ISO-8859-1 string)

Gregory Brown · Aug 12, 2009

In My Opinion (which may not be shared by anyone else), ruby 1.9's
String implementation is anything but robust. It's over-complicated,
under-specified, buggy as hell, and badly gets in the way when you want
to work with binary data or write programs which don't crash when given
unexpected input.

I'm not sure what binary data you've been having such great problems
with. Prawn deals with a lot of binary data, and yes, we needed to
make sure that it was being loaded as such and not accidentally
treated as encoded bytes, but I really didn't find this to be a major
undertaking. I guess this is because we didn't need to port over
existing 1.8 code and wrote our implementation with 1.9 in mind, but
maybe I'm missing some big problem that we didn't hit in our use case.

On a personal note, I wish you'd cut out the vitriol, because you're
acting like a jerk. You have learned a lot about the M17n system and
produced valuable resources in the process, and have helped uncovered
dark corners and bugs, and for that, the community can be appreciated
for the efforts. But if you manage to make everyone feel miserable
in the process with your abrasive attitude, I don't think that's going
to do anything for anyone.

You've made your feelings about the design decisions very clear. Now
can you maybe stick to the technical details so that these discussions
don't become nasty unnecessarily?

-greg

James Gray · Aug 12, 2009

"A fixed_encoding?() Regexp is one that will raise an
Encoding::CompatibilityError if matched against any String that =20
contains
a different Encoding from the Regexp itself."

I think that's not exactly correct:

$ irb19 --simple-prompt
=3D> 0

AFAICS, it will only raise an error if the matched string is of a
different encoding *and* is not ascii_only?

Encoding::CompatibilityError: incompatible encoding regexp match =20
(UTF-8
regexp with ISO-8859-1 string)

Thanks for the correction. I've updated the article you quoted with a =20=

correction.

James Edward Gray II=

Eric Hodel · Aug 12, 2009

I don't understand what you're getting at.

"dealing with [non 1:1 conversion round trips] is not an everyday
requirement for most people" is roughly equivalent to "four-digit
years is not an everyday requirement for most people" (or was, back
when people were using two-digit years)

You're saying you want to avoid external->Unicode->external encoding
transcodings.

I was stating that this is a design goal of ruby's encoding features.
(And likely causes much of the pain you feel in this area.)

But these are rarely problematic (I've still not seen an
example), and in those rare cases you could just handle the external
encoding as binary data.

Agreed. Furthermore, most of the time software is likely to only work
within a single encoding.

Remember also that for stateful encodings, you're forced to
transcode anyway - even ruby 1.9 won't handle snippets of
ISO_2022_JP in isolation, for example.

Software written without this in mind will probably be used this way
regardless of the original authors' intent (and will break), much like
two-digit-year software did when four-digit years became necessary.

PS: I think you can provide valuable input on how to make ruby's API
for encodings more robust and easier to use, but you seem to hate it
so much that you can't be bothered to raise issues in a way that will
get them fixed.

Brian Candler · Aug 12, 2009

Eric said:
PS: I think you can provide valuable input on how to make ruby's API
for encodings more robust and easier to use, but you seem to hate it
so much that you can't be bothered to raise issues in a way that will
get them fixed.

It's not so much "can't be bothered", as "don't believe that a U-turn is
going to happen".

Maybe some bandaids would be accepted (e.g. ASCII-8BIT is compatible
with everything and forces the result to ASCII-8BIT), but I'm hesitant
to propose enlarging the ruleset further.

Gregory Brown · Aug 12, 2009

Maybe some bandaids would be accepted (e.g. ASCII-8BIT is compatible
with everything and forces the result to ASCII-8BIT), but I'm hesitant
to propose enlarging the ruleset further.

I think this is a good change that would at least cause mistakes to
fail faster.

I also suggested a simple binary string syntax on ruby-core, allowing:

%b{GIF} to be shorthand for "GIF".force_encoding("BINARY")

(Though that's admittedly more cosmetic than functionally significant)

A U-Turn is very unlikely to happen, but I imagine Matz will be
receptive for polishing things around the edges.

Brian Candler · Aug 13, 2009

Gregory said:
A U-Turn is very unlikely to happen, but I imagine Matz will be
receptive for polishing things around the edges.

I have put a few ideas in a document 'alternatives.markdown' at the same
location.

The other possibility which may make sense is to transcode
automatically. For example, in

s3 = s1 + s2

then s2 is transcoded to s1's encoding, and the result s3 always has
s1's encoding.

That could actually be useful in helping to combine strings from
different sources. All the compatibility rules would vanish, and rather
than raising exceptions, ruby would just "do the right thing".
Transcoding to BINARY/ASCII-8BIT would be a null operation, so building
a binary string would be safe too.

This isn't a total U-turn, but it would be quite a major shift and I
suspect too big for 1.9.x.

Gregory Brown · Aug 13, 2009

That could actually be useful in helping to combine strings from
different sources. All the compatibility rules would vanish, and rather
than raising exceptions, ruby would just "do the right thing".
Transcoding to BINARY/ASCII-8BIT would be a null operation, so building
a binary string would be safe too.

This isn't a total U-turn, but it would be quite a major shift and I
suspect too big for 1.9.x.

Yeah, this is also a reasonable behavior, IMO. However, I think Matz
has some reservation about (potentially lossy) transcoding, which is
the reason for the M17N system in the first place. Special casing
form ASCII-8BIT might be more conservative.

-greg

comp.lang.vhdl FAQ part 1 of 4: general	0	Jul 8, 2003
comp.lang.vhdl FAQ part 3 of 4: products & services	0	Jul 8, 2003
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

[ANN] 1.9 String and M17N documentation

Brian Candler

Gregory Brown

James Gray

Brian Candler

James Gray

lith

Eric Hodel

Brian Candler

Brian Candler

Eric Hodel

Brian Candler

Brian Candler

Brian Candler

Gregory Brown

James Gray

Eric Hodel

Brian Candler

Gregory Brown

Brian Candler

Gregory Brown

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads