R1.9 mixed encoding in file

V

Vít Ondruch

Hello

I wonder if it is possible to enforce encoding of string in ruby 1.9.
Let say I have following example:

C:\enc>echo p 'test'.encoding > encoding.rb
C:\enc>ruby encoding.rb
#<Encoding:US-ASCII>

Thats fine. But what if I like to have in single file ASCII, UTF-8 or
strings with other encodings, i.e.

C:\enc>echo p 'zufällige_žluÅ¥ouÄký'.encoding > encoding.rb
C:\enc>ruby encoding.rb
encoding.rb:1: invalid multibyte char (US-ASCII)

I know that for this particular case I could use directive on top of the
file, but I would like to see something in following manner:

String.new 'zufällige_žluÅ¥ouÄký', Encoding.CP852

It means read the content in between quotes binary and interpret it
according to specified encoding.

Vit
 
J

James Gray

Hello
Hello.

I wonder if it is possible to enforce encoding of string in ruby 1.9.
Let say I have following example:

C:\enc>echo p 'test'.encoding > encoding.rb
C:\enc>ruby encoding.rb
#<Encoding:US-ASCII>

Thats fine. But what if I like to have in single file ASCII, UTF-8 or
strings with other encodings, i.e.

C:\enc>echo p 'zuf=C3=A4llige_=C5=BElu=C5=A5ou=C4=8Dk=C3=BD'.encoding =
encoding.rb
C:\enc>ruby encoding.rb
encoding.rb:1: invalid multibyte char (US-ASCII)

I know that for this particular case I could use directive on top of =20=
the
file, but I would like to see something in following manner:

String.new 'zuf=C3=A4llige_=C5=BElu=C5=A5ou=C4=8Dk=C3=BD', = Encoding.CP852

It means read the content in between quotes binary and interpret it
according to specified encoding.

The problem with an idea like this is that before your String is ever =20=

created the code to create it must be read (correctly) by Ruby's =20
parser and formed into a proper String literal. That would be =20
impossible to do if String literals could be in any random Encoding.

You have a couple of options though:

* Just set an Encoding like UTF-8 for the source code, enter =20
everything in UTF-8, and transcode it into the needed Encoding. This =20=

would make your example something like:

# encoding: UTF-8
cp852 =3D "zuf=C3=A4llige_=C5=BElu=C5=A5ou=C4=8Dk=C3=BD".encode("CP852"=
) # literal in =20
UTF-8

* Have one or more data files the program reads needed String objects =20=

from. Those files can be in any Encoding you need and you can specify =20=

it to IO operations, so your String objects are returned with that =20
Encoding.

I hope that helps.

James Edward Gray II
 
V

Vít Ondruch

James said:
The problem with an idea like this is that before your String is ever
created the code to create it must be read (correctly) by Ruby's
parser and formed into a proper String literal. That would be
impossible to do if String literals could be in any random Encoding.

Yes, I understand that you have to parse the file. However, if I am
right, you still have to read the file binary in case you are looking
for some encoding directive on top of file. So from my point of view, it
shouldn't be big problem to read until first quotes, suppose the file is
stored in the encoding designed on top of the file. Then read whatever
in between quotes as binary and decide later how to interpret that
binary data, by suggested encoding in second parameter of string
constructor.
You have a couple of options though:

* Just set an Encoding like UTF-8 for the source code, enter
everything in UTF-8, and transcode it into the needed Encoding. This
would make your example something like:

# encoding: UTF-8
cp852 = "zufällige_žluÅ¥ouÄký".encode("CP852") # literal in
UTF-8

* Have one or more data files the program reads needed String objects
from. Those files can be in any Encoding you need and you can specify
it to IO operations, so your String objects are returned with that
Encoding.

Both your suggestions are valid of course, but I consider them as
solutions far from ideal. They brings far more complexity than desired.
I hope that helps.

James Edward Gray II

Of course my idea could be considered naive and there might be many
technical issues with parser, etc. which prevents the implementation.
Nevertheless, it would be nice feature.

Thank you for you suggestion anyway.

Vit
 
J

James Gray

Yes, I understand that you have to parse the file. However, if I am
right, you still have to read the file binary in case you are looking
for some encoding directive on top of file.

You don't really have to:

$ cat source_encoding.rb
# encoding: UTF-8

output =3D ""
open(__FILE__, "r:US-ASCII") do |source|
first_line =3D source.gets
if first_line =3D~ /coding:\s*(\S+)/
source.set_encoding($1)
else
output << first_line
end
output << source.read
end
p [output.encoding, output[0...20] + "=E2=80=A6"]
$ ruby_dev source_encoding.rb
[#<Encoding:UTF-8>, "\noutput =3D \"\"\nopen(__=E2=80=A6"]

James Edward Gray II
 
V

Vít Ondruch

James said:
On Aug 7, 2009, at 9:47 AM, Vít Ondruch wrote:

You don't really have to:

It is disturbing that this approach will fail as soon as the file is
UTF-16 encoded or it has BOM for UTF-8, etc.

Vit
 
J

James Gray

It is disturbing that this approach will fail as soon as the file is
UTF-16 encoded or it has BOM for UTF-8, etc.

You are not allowed to set the source encoding to a non-ASCII =20
compatible encoding, if memory serves. That eliminates any issues =20
with encodings like UTF-16. This makes perfect sense as there's no =20
way to reliably support the magic encoding comment unless we can count =20=

on being able to read at least that far.

A BOM could be handled similarly to what I showed. You need to open =20
the file in ASCII-8BIT and check the beginning bytes, then you could =20
switch to US-ASCII and finish reading the first line (or to the second =20=

if a shebang line is includes), then switch encodings again if needed =20=

and finish processing.

James Edward Gray II
 
V

Vít Ondruch

You are not allowed to set the source encoding to a non-ASCII
compatible encoding, if memory serves.

Where is it documented please?
That eliminates any issues
with encodings like UTF-16. This makes perfect sense as there's no
way to reliably support the magic encoding comment unless we can count
on being able to read at least that far.

Needed to say that XML parsers can handle such cases, i.e. when xml
header is in different encoding than the rest of document.
A BOM could be handled similarly to what I showed. You need to open
the file in ASCII-8BIT and check the beginning bytes, then you could
switch to US-ASCII and finish reading the first line (or to the second
if a shebang line is includes), then switch encodings again if needed
and finish processing.

May be this technique could be used for reading UTF-16 encoded files, if
needed? However this is too far from my initial post :)
James Edward Gray II

Vit
 
J

James Gray

Where is it documented please?

I'm not sure it's officially documented yet.

Ruby does throw an error in this scenario though:

$ ruby_dev
# encoding: UTF-16BE
ruby_dev: UTF-16BE is not ASCII compatible (ArgumentError)

and:

$ ruby_dev -e 'puts "\uFEFF# encoding: UTF-16BE".encode("UTF-16BE")' | =20=

ruby_dev
-:1: invalid multibyte char (UTF-8)

I believe this is the relevant code from Ruby's parser:

static void
parser_set_encode(struct parser_params *parser, const char *name)
{
int idx =3D rb_enc_find_index(name);
rb_encoding *enc;

if (idx < 0) {
rb_raise(rb_eArgError, "unknown encoding name: %s", name);
}
enc =3D rb_enc_from_index(idx);
if (!rb_enc_asciicompat(enc)) {
rb_raise(rb_eArgError, "%s is not ASCII compatible", =
rb_enc_name(enc));
}
parser->enc =3D enc;
}
Needed to say that XML parsers can handle such cases, i.e. when xml
header is in different encoding than the rest of document.

I doubt we can say that universally. :)

Also, what you said isn't very accurate. For example, "in different =20
encoding than the rest of document" is not a possible occurrence =20
according to the XML 1.1 specification =
(http://www.w3.org/TR/2006/REC-xml11-20060816/=20
) which states:

"It is a fatal error when an XML processor encounters an entity with =20
an encoding that it is unable to process. It is a fatal error if an =20
XML entity is determined (via default, encoding declaration, or higher-=20=

level protocol) to be in a certain encoding but contains byte =20
sequences that are not legal in that encoding."

All XML parsers are required to assume UTF-8 unless told otherwise and =20=

to be able to recognize UTF-16 by a required BOM. Beyond that, they =20
are not required to recognize any other encodings, though they may of =20=

course. Their encoding declaration can be expressed in ASCII and, =20
since they assume UTF-8 by default, this is similar to what Ruby =20
does. It allows a switch to an ASCII-compatible encoding.

XML processors may do more. For example, they can accept a different =20=

encoding from an external source to support things like HTTP headers =20
and MIME types. Ruby doesn't really have access to such sources at =20
execution time, so that option doesn't apply to the case we are =20
discussing. However, XML processors may also recognize other BOM's =20
and Ruby could do this.
May be this technique could be used for reading UTF-16 encoded =20
files, if
needed?

Yes, Ruby could recognize BOM's for non-ASCII compatible encodings to =20=

support them. A BOM would be required in this case though, just as it =20=

is in an XML processor that doesn't have external information.

Ruby doesn't currently do this, as near as I can tell.

Note that this would not give what you purposed in your initial =20
message: multiple encodings in the same file. Ruby doesn't support =20
that and isn't ever likely to. An XML processor that supports such =20
things is in violation of its specification as I understand it.

Besides, not many text editors that I'm aware of make it super easy to =20=

edit in multiple encodings. :)

James Edward Gray II
 
C

Caleb Clausen

file, but I would like to see something in following manner:

String.new 'zuf=C3=A4llige_=C5=BElu=C5=A5ou=C4=8Dk=C3=BD', Encoding.CP852

You seem to be asking for the ability to have individual string
literals have encoding different from that of the program as a whole.
Why not this:

#encoding: ascii-8bit
'zuf=C3=A4llige_=C5=BElu=C5=A5ou=C4=8Dk=C3=BD'.force_encoding 'cp852'
'some utf8 data'.force_encoding 'utf-8'
'some sjis data'.force_encoding 'sjis'

I am far from an expert on encodings, but in my (admittedly minimalist
and perhaps inadequate) testing, this seems to basically work.

There are going to be holes in this; data in nonascii compatible
encodings in particular may give trouble. However, if the string data
does not contain the bytes 0x27 (ascii ') or 0x5C (ascii \) there will
be no problem. Whether this will work in particular circumstances
given a known encoding and data to be represented in it is unknown in
general, but surely very often the case. If it's the single quote
character that causes the problem, you can switch to a different
character using the%q[] quote syntax. In extremis, a single quoted
here document may be called for:

<<-'end'
lotsa ' and \ here, but ruby don't care
end

This form of string has the advantage of having no special characters
at all, and you can choose the sequence of bytes that makes up the
string terminator to be anything you want. (but you do end up with an
extra (ascii) newline at the end...)

Another challenge will be editing this file. There's no editor out
there that could actually display this kind of thing correctly; you'll
have to become proficient at editing it as binary, or at least find an
editor than can tolerate arbitrary binary chars in its ascii.
 
B

Brian Candler

Vít Ondruch said:
I know that for this particular case I could use directive on top of the
file, but I would like to see something in following manner:

String.new 'zufällige_žluÅ¥ouÄký', Encoding.CP852

It's not pretty, but

str = "zuf\x84llige_\xA7lu\x9Cou\x9Fk\xEC".force_encoding("CP852")

will probably do the job.
 
V

Vít Ondruch

Caleb said:
You seem to be asking for the ability to have individual string
literals have encoding different from that of the program as a whole.
Why not this:

#encoding: ascii-8bit
'zufällige_žluÅ¥ouÄký'.force_encoding 'cp852'
'some utf8 data'.force_encoding 'utf-8'
'some sjis data'.force_encoding 'sjis'

Hmmm, that is a good idea!!!

Which leads me to the question why is default encoding US-ASCII instead
of ASCII-8BIT?
Another challenge will be editing this file. There's no editor out
there that could actually display this kind of thing correctly; you'll
have to become proficient at editing it as binary, or at least find an
editor than can tolerate arbitrary binary chars in its ascii.

Its almost the same challenge if you want to edit single file in
different encoding than is your system encoding ... so its not relevant
... in contrary, it could be even easier. Because in my case, I don't
care much about content, since I need more encodings for testing.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,701
Latest member
XavierQ83

Latest Threads

Top