A
Atoli Atoli
Hello
My ruby 1.9.2 does some strange things when manipulating encodings
especially when reading and writing text file.
I have a file (attached) with broken UTF-8 characters: E5 AD 97 E6 99 2E
The offending sequence is E6 99
And here's the example I'm using to try to figure out how the heck does
this Encoding stuff work:
#--------------------------------------
data = File.open("broken.txt", "r:UTF-8") { |f| f.read }
puts data.valid_encoding?
utf_a = data.encode("UTF-8", invalid: :replace, undef: :replace,
replace: "_")
puts utf_a.valid_encoding?
utf_b = data.encode("UTF-8", "UTF-8", invalid: :replace, undef:
:replace, replace: "_")
puts utf_b.valid_encoding?
puts (utf_a == utf_b) && (data == utf_b)
File.open("valid.txt", "w:UTF-8") { |f| f.write(utf_a) }
#--------------------------------------
The output is:
false
false
true
true
Basically I'm trying to replace the broken sequences with "_", but the
encode method doesn't seem to do any replacements, maybe because the
forced encoding is already set to UTF-8?
I've read James Edward post concerning strings encoding in ruby 1.9 and
also candlerb's doc, but didn't find anything.
This can't be so complicated, right? Sure I'm missing something.
Thank you.
Atoli.
Attachments:
http://www.ruby-forum.com/attachment/5415/broken.txt
My ruby 1.9.2 does some strange things when manipulating encodings
especially when reading and writing text file.
I have a file (attached) with broken UTF-8 characters: E5 AD 97 E6 99 2E
The offending sequence is E6 99
And here's the example I'm using to try to figure out how the heck does
this Encoding stuff work:
#--------------------------------------
data = File.open("broken.txt", "r:UTF-8") { |f| f.read }
puts data.valid_encoding?
utf_a = data.encode("UTF-8", invalid: :replace, undef: :replace,
replace: "_")
puts utf_a.valid_encoding?
utf_b = data.encode("UTF-8", "UTF-8", invalid: :replace, undef:
:replace, replace: "_")
puts utf_b.valid_encoding?
puts (utf_a == utf_b) && (data == utf_b)
File.open("valid.txt", "w:UTF-8") { |f| f.write(utf_a) }
#--------------------------------------
The output is:
false
false
true
true
Basically I'm trying to replace the broken sequences with "_", but the
encode method doesn't seem to do any replacements, maybe because the
forced encoding is already set to UTF-8?
I've read James Edward post concerning strings encoding in ruby 1.9 and
also candlerb's doc, but didn't find anything.
This can't be so complicated, right? Sure I'm missing something.
Thank you.
Atoli.
Attachments:
http://www.ruby-forum.com/attachment/5415/broken.txt