Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files

Atoli Atoli · Nov 18, 2010

Hello

My ruby 1.9.2 does some strange things when manipulating encodings
especially when reading and writing text file.
I have a file (attached) with broken UTF-8 characters: E5 AD 97 E6 99 2E
The offending sequence is E6 99

And here's the example I'm using to try to figure out how the heck does
this Encoding stuff work:
#--------------------------------------
data = File.open("broken.txt", "r:UTF-8") { |f| f.read }
puts data.valid_encoding?

utf_a = data.encode("UTF-8", invalid: :replace, undef: :replace,
replace: "_")
puts utf_a.valid_encoding?

utf_b = data.encode("UTF-8", "UTF-8", invalid: :replace, undef:
:replace, replace: "_")
puts utf_b.valid_encoding?
puts (utf_a == utf_b) && (data == utf_b)

File.open("valid.txt", "w:UTF-8") { |f| f.write(utf_a) }
#--------------------------------------

The output is:
false
false
true
true

Basically I'm trying to replace the broken sequences with "_", but the
encode method doesn't seem to do any replacements, maybe because the
forced encoding is already set to UTF-8?

I've read James Edward post concerning strings encoding in ruby 1.9 and
also candlerb's doc, but didn't find anything.

This can't be so complicated, right? Sure I'm missing something.

Thank you.

Atoli.

Attachments:
http://www.ruby-forum.com/attachment/5415/broken.txt

brabuhr · Nov 18, 2010

Basically I'm trying to replace the broken sequences with "_", but the
encode method doesn't seem to do any replacements, maybe because the
forced encoding is already set to UTF-8?

For the case of fixing broken files, I would probably use iconv from the sh=
ell:

$ cat broken.txt
=E5=AD=97?.
$ iconv -f UTF8 -t UTF8 --byte-subst=3D_ broken.txt
=E5=AD=97__.

(I don't know if Ruby's Iconv module supports the subst options.)

For short strings, this seems to work:

irb(main):001:0> s =3D "\xE5\xAD\x97\xE6\x99\x2E"
=3D> "=E5=AD=97\xE6\x99."
irb(main):002:0> s.encoding
=3D> #<Encoding:UTF-8>
irb(main):003:0> s.valid_encoding?
=3D> false
irb(main):004:0> t =3D s.chars.map{|c| c.valid_encoding? ? c : '_'}.join
=3D> "=E5=AD=97__."
irb(main):005:0> t.valid_encoding?
=3D> true
irb(main):006:0> t.encoding
=3D> #<Encoding:UTF-8>

Atoli Atoli · Nov 18, 2010

Thanks for the tip.

For now, making ruby "think" the encoding is valid seems to work (it
doesn't break regular expressions at least).
So I just encode("UTF-8", "UTF-8", invalid: :replace, undef:
:replace, replace: "_") each time I read my files.

ruby unicode/string explosion (0xFF in utf-8)	2	Dec 11, 2010
[ruby 1.9] reading an UTF-8 encoded file	12	Mar 10, 2010
Forcing a string to valid UTF-8	2	Apr 26, 2010
Reading a CSV file with UTF-16LE encoding	4	Jan 13, 2011
Detect file encoding utf-8	3	Aug 29, 2007
Rdoc does not document UTF-8 files?	3	Jun 10, 2009
Ruby 1.9.2: /\w/u does not match umlauts ("Ã¼")	4	Sep 29, 2010
StringScanner and UTF-8 in ruby 1.9	0	Sep 16, 2009

Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files

Atoli Atoli

brabuhr

Atoli Atoli

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads