Reading Files: how to I specify the encoding ?

C

Claus Hausberger

Hello

I have a lot of xml and java files witch have German Umlauts and other
non ASCII files in them.

I want to read the files and convert them to UTF-8 using a Ruby script.

I convert the strings with this code:

def to_utf8(str)
str.unpack('U*').map do |c|
if c < 0x80
c.chr
else
'( u%04X )' % c
end
end.join
end

(taken from "The Ruby Way" by Hal Fulton).

sometimes it works, sometimes I get this error:
"malformed UTF-8 character"

I tought this might happen because the File is encoded in ISO-8859-1
(was written with Eclipse set to ISO-8859-1 for text encoding).

how can I read a file with Ruby and specify that it is read with
ISO-8859-1 encoding (similar to Java's BufferedReader where I can
specify the encoding).

any help welcome. best wishes

Claus
 
A

Alex Young

Claus said:
Hello

I have a lot of xml and java files witch have German Umlauts and other
non ASCII files in them.

I want to read the files and convert them to UTF-8 using a Ruby script.

I convert the strings with this code:

def to_utf8(str)
str.unpack('U*').map do |c|
I'd be surprised if this was right - you're telling it that you're
expecting the string to be UTF-8 already with that unpack format.

how can I read a file with Ruby and specify that it is read with
ISO-8859-1 encoding (similar to Java's BufferedReader where I can
specify the encoding).

Investigate Iconv in the standard library. It does what you need.
 
E

Enrique Comba Riepenhausen

Hello

I have a lot of xml and java files witch have German Umlauts and other
non ASCII files in them.

I want to read the files and convert them to UTF-8 using a Ruby
script.

I convert the strings with this code:

def to_utf8(str)
str.unpack('U*').map do |c|
if c < 0x80
c.chr
else
'( u%04X )' % c
end
end.join
end

(taken from "The Ruby Way" by Hal Fulton).

sometimes it works, sometimes I get this error:
"malformed UTF-8 character"

I tought this might happen because the File is encoded in ISO-8859-1
(was written with Eclipse set to ISO-8859-1 for text encoding).

how can I read a file with Ruby and specify that it is read with
ISO-8859-1 encoding (similar to Java's BufferedReader where I can
specify the encoding).

any help welcome. best wishes

Claus

Hallo Claus,

you could use jcode...

$KCODE = 'UTF8'
require 'jcode'

Cheers,

Enrique Comba Riepenhausen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top