Encoding problems .. ruby 1.9.2

B

Bhay Zone

I am pretty new to ruby and am trying to read text data coming from a
backend which can only be queried using proprietary Command Line
Interface commands.

The problem is that this text data contains non-ascii characters...I
don't know what these characters are .. and nor do I know the encoding.

Earlier, when we were using ruby 1.8.7 we had some code that handled
these characters pretty well. Now after switching to ruby 1.9.2, the
same code breaks with encoding errors like "invalid multibyte sequence"
in gsub.

Here is the code we were using to replace the non-ascii characters which
is breaking now. The code it breaks at the first line.

content.gsub!( "\221", '')
content.gsub!( "\222", '')
content.gsub!( "\223", '')
content.gsub!( "\224", '')
content.gsub!( "\246", '')
content.gsub!( "\247", '')
content.gsub!( "\237", '')
content.gsub!( "\377", '')
content.gsub!( "\226", '')
content.gsub!( "\227", '')
content.gsub!( "\\000", "?")
content.gsub!( "\\001", "?")
content.gsub!( "\FB01", "")
content.gsub!(/[\x80-\xFF]/,'')
content.gsub!(/[\x00-\x08]/,'')
content.gsub!(/[\x0B-\x0C]/,'')
content.gsub!(/[\x0E-\x1F]/,'')

I just cannot figure how to fix this problem and any help would be
greatly appreciated.
 
C

Caleb Clausen

I am pretty new to ruby and am trying to read text data coming from a
backend which can only be queried using proprietary Command Line
Interface commands.

The problem is that this text data contains non-ascii characters...I
don't know what these characters are .. and nor do I know the encoding.

Earlier, when we were using ruby 1.8.7 we had some code that handled
these characters pretty well. Now after switching to ruby 1.9.2, the
same code breaks with encoding errors like "invalid multibyte sequence"
in gsub.

Here is the code we were using to replace the non-ascii characters which
is breaking now. The code it breaks at the first line.

content.gsub!( "\221", '')
content.gsub!( "\222", '')
content.gsub!( "\223", '')
content.gsub!( "\224", '')
content.gsub!( "\246", '')
content.gsub!( "\247", '')
content.gsub!( "\237", '')
content.gsub!( "\377", '')
content.gsub!( "\226", '')
content.gsub!( "\227", '')
content.gsub!( "\\000", "?")
content.gsub!( "\\001", "?")
content.gsub!( "\FB01", "")
content.gsub!(/[\x80-\xFF]/,'')
content.gsub!(/[\x00-\x08]/,'')
content.gsub!(/[\x0B-\x0C]/,'')
content.gsub!(/[\x0E-\x1F]/,'')

I just cannot figure how to fix this problem and any help would be
greatly appreciated.

In 1.9, every string (and regular expression) has an encoding attached
to it. If there are any byte sequences in your string that don't match
the encoding, it causes errors. 1.8 was much more permissive about its
strings, allowing arbitrary binary data in any string, which is why it
worked better for you. You can get back the 1.8 behavior under 1.9 by
setting the encoding of your string objects to 'binary'.

My first suggestion would be to set the encoding of the string in the
variable content to binary before doing any of the gsub!s:
content.force_encoding('binary')

However, a better way would be to set the encoding of the IO object
the strings are read from. That way you don't need to force_encoding
each string as it comes in.

Even better is to figure out what the encoding this external tool is
using and set the IO's encoding to that. Then perhaps a lot of this
hacky string manglich could go away.

But this is still only half the story. You also have to consider the
encoding of the strings and regexps which get passed as the first
argument to gsub. Those string (and regexp) literals default to the
same encoding as the source file they're contained in. If no explicit
encoding is declared for a specific source file, ruby guesses an
encoding based on your environment (using the LOCALE env var and some
others that I can't remember right now). Often, this means ruby
assumes your sources
are utf-8 encoded.

You can declare a specific encoding explicitly by putting something
like this as the very first line in your source:
#encoding: binary
(or the second line if the first line is a shebang line).

I used the binary encoding in the example line above because that's
probably the one which will work best for you under the circumstances.
Declaring the source encoding to be binary is a bit hackish, but
probably the easiest way to get you where you want to go. If you
figure out what encoding your data is in, you're probably better off
declaring the source encoding to be the same thing, but there may be
more work involved there.

PS: there is some redundancy in the sequence of gsub!s you posted. The
first 10 (for "\221" thru "\227") are special cases of the 14th (for
/[\x80-\xFF]/) and can safely be deleted. Also, "\FB01" is the same
thing as "FB01" in both ruby 1.8 and 1.9 and probably not what you
wanted. (Maybe "\xFB\x01" is what you actually meant?)

HTH
 
B

Brian Candler

Bhay said:
I am pretty new to ruby and am trying to read text data coming from a
backend which can only be queried using proprietary Command Line
Interface commands.

The problem is that this text data contains non-ascii characters...I
don't know what these characters are .. and nor do I know the encoding.

How are you interfacing with this interface - a TCP socket? IO.popen?
Backticks? Something else? If you show the code which opens the
connection, we can show how to fix it.

TCP sockets default to "ASCII-8BIT" encoding, but for other methods,
unless you tell ruby what encoding to use, it will guess based on
environment variables on your PC. That is, the same program may work
fine on one PC but fail on another.

To avoid these problems, there are magic incantations you can add to
force ruby not to guess. e.g.

IO.popen: add "b" to the mode string

Backticks or %x: res = `foo`; res.force_encoding("ASCII-8BIT")

Or try running ruby with -Kn flag.
I just cannot figure how to fix this problem and any help would be
greatly appreciated.

It's probably possible to fix your code, as above. However, sticking
with ruby 1.8.7 is also a reasonable solution if you don't want to have
to deal with this sort of nonsense.

I had a go at reverse-engineering the string encoding behaviour of ruby
1.9. I gave up after documenting about 200 behaviours:
http://github.com/candlerb/string19/blob/master/string19.rb

I'm sticking with 1.8, because 1.9 makes my brain hurt.
 
B

Bhay Zone

Caleb, Brian - Thank you for your replies.

The source of this data is a bug tracking tool known as GNATS. Now this
tool also comes with a client which provides a command line util known
as query-pr to query GNATS. The output of query-pr is delimited text. If
you run query-pr from the linux shell, it prints the output on the
screen.

I invoke query-pr from my ruby program as follows (note the opening and
closing (``) characters.

result=`query-pr --expr 'Status="closed"'`
# parse the result and take appropriate action.

I am not very sure, but my guess is that the GNATS client uses TCP
sockets to interface with the GNATS DB.

Thanks for pointing out the redundancy, i'll fix that in my code.

Right now I have "# coding: utf-8" as the first line in the ruby file. I
found that while trying to figure out this problem and hoped it would
make magic ... but well ... :-(

I'll also try out the "# coding: binary" to see if that works for my
case.

I'm not sure if going back to ruby 1.8.7 is an option .. will keep that
as a last option.
 
B

Brian Candler

Bhay said:
I invoke query-pr from my ruby program as follows (note the opening and
closing (``) characters.

result=`query-pr --expr 'Status="closed"'`
# parse the result and take appropriate action.

That's backticks. Follow that line with:

result.force_encoding("ASCII-8BIT")

when running with ruby 1.9, before you start doing your substitutions.
I am not very sure, but my guess is that the GNATS client uses TCP
sockets to interface with the GNATS DB.

Maybe, but that's irrelevant here. Ruby is reading the output of
query-pr, as a string, and has decided to give it some arbitrary guessed
encoding.
Right now I have "# coding: utf-8" as the first line in the ruby file. I
found that while trying to figure out this problem and hoped it would
make magic ... but well ... :-(

I'll also try out the "# coding: binary" to see if that works for my
case.

It won't. It will only affect the coding of quoted string literals
within your code.
 
B

Brian Candler

Bhay said:
After 'result.force_encoding("ASCII-8BIT"), are the gsubs necessary?

Why do you do them in the ruby 1.8.7 version? If they served a purpose
there, then presumably they still serve a purpose.

All the force_encoding business is doing is preventing these lines from
crashing ruby 1.9. The bytes in the string from query-pr will still be
the same.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,701
Latest member
XavierQ83

Latest Threads

Top