Where is it documented please?
I'm not sure it's officially documented yet.
Ruby does throw an error in this scenario though:
$ ruby_dev
# encoding: UTF-16BE
ruby_dev: UTF-16BE is not ASCII compatible (ArgumentError)
and:
$ ruby_dev -e 'puts "\uFEFF# encoding: UTF-16BE".encode("UTF-16BE")' | =20=
ruby_dev
-:1: invalid multibyte char (UTF-8)
I believe this is the relevant code from Ruby's parser:
static void
parser_set_encode(struct parser_params *parser, const char *name)
{
int idx =3D rb_enc_find_index(name);
rb_encoding *enc;
if (idx < 0) {
rb_raise(rb_eArgError, "unknown encoding name: %s", name);
}
enc =3D rb_enc_from_index(idx);
if (!rb_enc_asciicompat(enc)) {
rb_raise(rb_eArgError, "%s is not ASCII compatible", =
rb_enc_name(enc));
}
parser->enc =3D enc;
}
Needed to say that XML parsers can handle such cases, i.e. when xml
header is in different encoding than the rest of document.
I doubt we can say that universally.
Also, what you said isn't very accurate. For example, "in different =20
encoding than the rest of document" is not a possible occurrence =20
according to the XML 1.1 specification =
(
http://www.w3.org/TR/2006/REC-xml11-20060816/=20
) which states:
"It is a fatal error when an XML processor encounters an entity with =20
an encoding that it is unable to process. It is a fatal error if an =20
XML entity is determined (via default, encoding declaration, or higher-=20=
level protocol) to be in a certain encoding but contains byte =20
sequences that are not legal in that encoding."
All XML parsers are required to assume UTF-8 unless told otherwise and =20=
to be able to recognize UTF-16 by a required BOM. Beyond that, they =20
are not required to recognize any other encodings, though they may of =20=
course. Their encoding declaration can be expressed in ASCII and, =20
since they assume UTF-8 by default, this is similar to what Ruby =20
does. It allows a switch to an ASCII-compatible encoding.
XML processors may do more. For example, they can accept a different =20=
encoding from an external source to support things like HTTP headers =20
and MIME types. Ruby doesn't really have access to such sources at =20
execution time, so that option doesn't apply to the case we are =20
discussing. However, XML processors may also recognize other BOM's =20
and Ruby could do this.
May be this technique could be used for reading UTF-16 encoded =20
files, if
needed?
Yes, Ruby could recognize BOM's for non-ASCII compatible encodings to =20=
support them. A BOM would be required in this case though, just as it =20=
is in an XML processor that doesn't have external information.
Ruby doesn't currently do this, as near as I can tell.
Note that this would not give what you purposed in your initial =20
message: multiple encodings in the same file. Ruby doesn't support =20
that and isn't ever likely to. An XML processor that supports such =20
things is in violation of its specification as I understand it.
Besides, not many text editors that I'm aware of make it super easy to =20=
edit in multiple encodings.
James Edward Gray II