James said:
I just wanted to say that I enjoyed reading through what you have
created. I think you've shown a neat way to document behaviors, with
your comment and code mix. Even your simple alias of assert_equal()
to is() really adds to the overall presentation.
Thanks James.
It does run for me on Mac OS X, though I do get a warning:
$ ruby_dev string19.rb
Loaded suite string19
Started
WARNING: got "UTF-8" as locale_charmap for LANG=C
.
Finished in 0.589675 seconds.
Hmm. Could you try setting replacing 'LANG' with 'LC_ALL' globally? A
reread of the setlocale(3) manpage under Linux shows that LANG is only
tried as a last resort, so perhaps your Mac has a higher-priority
environment variable set.
* I'm not sure this is correct:
# 5. If one object is a String which contains only 7-bit ASCII
characters
# (ascii_only?), then the objects are compatible and the result has the
# encoding of the other object.
Thank you, fixed.
* I don't believe this is accurate:
# Normally, writing a string to a file ignores the encoding property.
I think we crossed over on that one. I spotted the error after
re-reading your articles and posted a correction - I think it's right
now.
* You say that m17n's complexity can be avoided if we just used UTF-8
everywhere and transcoded incoming and outgoing data. I agree. If we
do that in Ruby 1.9 though, transcode all data as it comes in and just
work with UTF-8 internally, doesn't all the complexity of m17n go
away? Compatible encodings, the comparison order of differing
encodings, and the like will all be non-issues.
Yes, for scripts that process text. And in practice, this is what most
people processing text will find: their source is in their preferred
encoding, their external files are in their preferred encoding, and
everything "just works" - pretty much in the way that ruby 1.8 did with
$KCODE.
I have two key problems.
1. Working with binary. I can force the encoding on my own source files,
and I can force the encoding on any files that I open, but I still have
to interact with other libraries which return strings. If I build a
string by concatenating strings taken from elsewhere, I have to force
the encodings. If I forget, it may work sometimes (if those strings are
7-bit), but will fail if they are 8-bit.
Maybe this could be fixed by making the ASCII-8BIT encoding be
compatible with everything, and always give an ASCII-8BIT result. But
that would be saying, in essence, an ASCII-8BIT String is one class of
object, and everything else is another class.
2. Working with other people's libraries.
Take REXML as an example. Suppose I decide I want to do this:
doc = REXML:
ocument.new(src)
Under 1.8, I could do this without worrying. But under 1.9, a whole host
of questions tumble out.
- will REXML require me to have set the src to the correct encoding?
- in order to parse it, will it reset the encoding of my 'src' object?
What will it do if 'src' is frozen? Will it dup the string?
XML documents carry their encoding within them. There's the xml charset
declaration, and the BOM, and failing that the document is UTF-8 by
definition, because if it were in a different encoding, then it *must*
declare it:
http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding
So I reckon REXML should ignore the encoding of src. Even if it were
tagged as (say) ISO-8859-1 because that's the locale encoding, or
ASCII-8BIT because it came from a socket, it should be treated as UTF-8
unless declared otherwise. And then if I access the node using #text,
would I get something tagged as UTF-8, or something else?
The only way to be sure is to try it and see (and a quick test suggests
that it does work in the way I described).
But this process has to be repeated for every library you might use.