File.new and encoding

  • Thread starter Achim Domma (SyynX Solutions GmbH)
  • Start date
A

Achim Domma (SyynX Solutions GmbH)

Hi,

I'm still quite new to ruby, but have written a simple code generator.
The generator opens some files and combines them to a new one. The
resulting file is encoded as iso-8859-1, but it looks like ruby writes
an UTF-8 Markter to the beginning of the file. Is that possible?

How can I tell ruby which encoding to use, if I write to textfiles?

Any pointers to documentation are wellcome, but I didn't find something
usefull using google.

regards,
Achim
 
R

Robert Klemme

Achim said:
Hi,

I'm still quite new to ruby, but have written a simple code generator.
The generator opens some files and combines them to a new one. The
resulting file is encoded as iso-8859-1, but it looks like ruby writes
an UTF-8 Markter to the beginning of the file. Is that possible?

What's an UTF-8 marker? I know only two byte UTF-16 marker but AFAIK
there is no marker for UTF-8. Did I miss something?
How can I tell ruby which encoding to use, if I write to textfiles?

Any pointers to documentation are wellcome, but I didn't find
something usefull using google.

Encoding is not an easy issue with ruby - I guess by default it uses the
default enconding of your environment. But you can specify certain
(Japanese) encodings with command line option -K. HTH

Kind regards

robert
 
N

nobu

Hi,

At Wed, 30 Nov 2005 00:17:29 +0900,
Robert Klemme wrote in [ruby-talk:167988]:
What's an UTF-8 marker? I know only two byte UTF-16 marker but AFAIK
there is no marker for UTF-8. Did I miss something?

It would be UTF-8 encoded BOM, but ruby itself never write it
automatically.

Can't you show the code?
 
A

Achim Domma (SyynX Solutions GmbH)

It would be UTF-8 encoded BOM, but ruby itself never write it
automatically. [...]
Can't you show the code?

Trying to reproduce the problem in a smaller example, I figured out,
that I'm reading the BOM from one of my source files. Sorry for the
confusion. I'm doing something like:

File.open("target","w") do |target|
File.open("source","r") do |source|
source.each_line do |line|
... some processing ...
target.write(line)
end
end
end


source seems to contain the BOM and it is writen to target. Any hint on
how to strip the BOM?

regards,
Achim
 
A

Alex Fenton

I'm doing something like:
File.open("target","w") do |target|
File.open("source","r") do |source|
source.each_line do |line|
... some processing ...
target.write(line)
end
end
end

Have you looked at 'iconv' in the standard library?

http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/classes/Iconv.html

Assuming all your input files were ISO-8859-1, and you wanted your output file in UTF-8, your example might look something like (untested):

File.open("target","w") do |target|
Iconv.open('UTF-8', 'ISO-8859-1') do | converter |
File.open("source","r") do |source|
source.each_line do |line|
# ... some processing ...
target.write( converter.iconv(line) )
end
end
target << converter.iconv(nil)
end
end

Iconv should deal with BOMs, stripping them out or adding them in where necessary. I'm not sure if it will complain if it finds a BOM mid-stream (as you open your second and subsequent input file) - if so you could just instantiate a new Iconv to deal with each input.

HTH
alex
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,828
Latest member
LauraCastr

Latest Threads

Top