gsub: invalid byte sequence in US-ASCII

R.. Kumar · Jun 15, 2010

I download the page http://www.ruby-forum.com/forum/4 using wget. Then i
cat the file and pipe to gsub.

I get: -e:1:in `gsub': invalid byte sequence in US-ASCII (ArgumentError)

wget -q -k -O index11.html http://www.ruby-forum.com/forum/4

cat index11.html | ruby -pe 'gsub(/href=a\/"/,"href=\"'${base}'")' >
ofile

(The value of base is http://www.ruby-forum.com/)

So what must i do so this command can run. It runs fine with another
site.
If i replace ruby with perl -pe 's|....|g' that works fine.

I actually run this in a loop with various URLS from cron.

Brian Candler · Jun 15, 2010

R.. Kumar said:
If i replace ruby with perl -pe 's|....|g' that works fine.

Replacing ruby 1.9.x with ruby 1.8.x is just as effective, and I would
recommend this for maintaining your sanity.

I can only guess that the external encoding picked up from your
platform's environment is US-ASCII (are you using cygwin by any chance?)

You probably need to set the external encoding to UTF-8 or BINARY for
your regexp not to crash. Try adding -Ku or -Kn to your ruby command
line.

If you want to attempt to understand String encoding in ruby 1.9, then
good luck to you. I tried, documented what I found here:
http://github.com/candlerb/string19/blob/master/string19.rb
and gave up after about 200 rules. There is no official documentation.

Caleb Clausen · Jun 15, 2010

I download the page http://www.ruby-forum.com/forum/4 using wget. Then i
cat the file and pipe to gsub.

I get: -e:1:in `gsub': invalid byte sequence in US-ASCII (ArgumentError)

wget -q -k -O index11.html http://www.ruby-forum.com/forum/4

cat index11.html | ruby -pe 'gsub(/href=a\/"/,"href=\"'${base}'")' >
ofile

(The value of base is http://www.ruby-forum.com/)

So what must i do so this command can run. It runs fine with another
site.
If i replace ruby with perl -pe 's|....|g' that works fine.

I actually run this in a loop with various URLS from cron.

Handling this kind of thing right means tracking encodings right....
which means you'd have to extract the encoding from the http session
and then mark the input as that encoding in your ruby script... and
then deal with the inevitable incompatible encoding errors that would
crop up.

It sounds to me, tho, like in this case what you have a just some
hacky little scripts and it would be acceptable for them to be
imperfect. So, in that case, I suggest trying to set the encoding for
your source file(s) to BINARY. That's a hack, but it ought to be
effective.

Alternately, you could drop back to the 1.8 interpreter, like Brian
suggests, which more or less uses BINARY as the default source
encoding.

Bill Kelly · Jun 15, 2010

Caleb said:
Handling this kind of thing right means tracking encodings right....
which means you'd have to extract the encoding from the http session
and then mark the input as that encoding in your ruby script... and
then deal with the inevitable incompatible encoding errors that would
crop up.

It sounds to me, tho, like in this case what you have a just some
hacky little scripts and it would be acceptable for them to be
imperfect. So, in that case, I suggest trying to set the encoding for
your source file(s) to BINARY. That's a hack, but it ought to be
effective.

Additional info on the source, external, and internal encodings:

http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings

For the OP, I'd expect `ruby -EBINARY ...` or `ruby -EASCII-8BIT ...`
should work.

Regards,

Bill

R.. Kumar · Jun 16, 2010

Brian said:
Replacing ruby 1.9.x with ruby 1.8.x is just as effective, and I would
recommend this for maintaining your sanity.

I can only guess that the external encoding picked up from your
platform's environment is US-ASCII (are you using cygwin by any chance?)

You probably need to set the external encoding to UTF-8 or BINARY for
your regexp not to crash. Try adding -Ku or -Kn to your ruby command
line.

If you want to attempt to understand String encoding in ruby 1.9, then
good luck to you. I tried, documented what I found here:
http://github.com/candlerb/string19/blob/master/string19.rb
and gave up after about 200 rules. There is no official documentation.

1. I have moved to 1.9 long back. Don't want to move back.

2. I am on OSX. I think I had _probably_ (?) solved this issue on my
previous laptop (PPC) -- now I;ve migrated my user to a new machine
(Snow Leopard). All my settings, should have moved. (I say this since I
had commented out the perl line).

LC_ALL=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LANG=C

3. Thanks for the link, i will read it. But NO, i have already read up
enough a few months back, and do not have the energy to do it again :-(.

Thanks for the tip on -Ku / -Kn

R.. Kumar · Jun 16, 2010

Brian said:
I can only guess that the external encoding picked up from your
platform's environment is US-ASCII (are you using cygwin by any chance?)

You probably need to set the external encoding to UTF-8 or BINARY for
your regexp not to crash. Try adding -Ku or -Kn to your ruby command
line.

Ok, I've got it. The problem occured when the program was run by cron.
My user setting is UTF and it ran fine in terminal. So now in the
program itself I have set LC_CTYPE and LC_ALL to en_US.UTF-8. Hopefully,
it should work fine now.

Thanks.

again: invalid byte sequence in US-ASCII	2	May 1, 2011
invalid byte sequence in US-ASCII (ArgumentError)	15	Feb 16, 2009
slice! invalid byte sequence in UTF-8	9	Mar 3, 2011
Extended ASCII character handeling	3	Nov 17, 2010
Ruby 1.9.2: How to sanitize text with invalid characters?	6	Oct 11, 2010
Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence.	6	Jan 21, 2010
Regex back reference in gsub	4	Jul 13, 2005
Ruby 1.9 - ArgumentError: incompatible encoding regexp match(US-ASCII regexp with ISO-2022-JP string	0	Mar 31, 2008

gsub: invalid byte sequence in US-ASCII

R.. Kumar

Brian Candler

Caleb Clausen

Bill Kelly

R.. Kumar

R.. Kumar

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads