Reading Files: how to I specify the encoding ?

Claus Hausberger · May 14, 2007

Hello

I have a lot of xml and java files witch have German Umlauts and other
non ASCII files in them.

I want to read the files and convert them to UTF-8 using a Ruby script.

I convert the strings with this code:

def to_utf8(str)
str.unpack('U*').map do |c|
if c < 0x80
c.chr
else
'( u%04X )' % c
end
end.join
end

(taken from "The Ruby Way" by Hal Fulton).

sometimes it works, sometimes I get this error:
"malformed UTF-8 character"

I tought this might happen because the File is encoded in ISO-8859-1
(was written with Eclipse set to ISO-8859-1 for text encoding).

how can I read a file with Ruby and specify that it is read with
ISO-8859-1 encoding (similar to Java's BufferedReader where I can
specify the encoding).

any help welcome. best wishes

Claus

Alex Young · May 14, 2007

Claus said:
Hello

I have a lot of xml and java files witch have German Umlauts and other
non ASCII files in them.

I want to read the files and convert them to UTF-8 using a Ruby script.

I convert the strings with this code:

def to_utf8(str)
str.unpack('U*').map do |c|

I'd be surprised if this was right - you're telling it that you're
expecting the string to be UTF-8 already with that unpack format.

how can I read a file with Ruby and specify that it is read with
ISO-8859-1 encoding (similar to Java's BufferedReader where I can
specify the encoding).

Investigate Iconv in the standard library. It does what you need.

Enrique Comba Riepenhausen · May 14, 2007

Hello

I have a lot of xml and java files witch have German Umlauts and other
non ASCII files in them.

I want to read the files and convert them to UTF-8 using a Ruby
script.

I convert the strings with this code:

def to_utf8(str)
str.unpack('U*').map do |c|
if c < 0x80
c.chr
else
'( u%04X )' % c
end
end.join
end

(taken from "The Ruby Way" by Hal Fulton).

sometimes it works, sometimes I get this error:
"malformed UTF-8 character"

I tought this might happen because the File is encoded in ISO-8859-1
(was written with Eclipse set to ISO-8859-1 for text encoding).

how can I read a file with Ruby and specify that it is read with
ISO-8859-1 encoding (similar to Java's BufferedReader where I can
specify the encoding).

any help welcome. best wishes

Claus

Hallo Claus,

you could use jcode...

$KCODE = 'UTF8'
require 'jcode'

Cheers,

Enrique Comba Riepenhausen

How to specify a search query when I want to create an application in C++ with the option to open files?	0	Nov 3, 2023
Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files	2	Nov 18, 2010
Mechanize and encoding	1	Nov 22, 2008
Ruby1.9: Encoding problems (how to use #force_encoding ?)	5	Sep 1, 2009
How to create a file with UTF-8 encoding	4	Sep 21, 2009
when I read gzipped response from web-servers, GzipReader returnssometimes 'invalid compressed data	0	Apr 8, 2013
Specify Character Encoding On CD?	12	Oct 17, 2004
File.new and encoding	4	Nov 29, 2005

Reading Files: how to I specify the encoding ?

Claus Hausberger

Alex Young

Enrique Comba Riepenhausen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads