Unicode illegal characters problem

AEtzold · Nov 3, 2007

Dear all,

when using Iconv, I am repeatedly running into
problems.
I tried to run this bit of code:

#!/usr/bin/env ruby
$KCODE = 'u'
require 'iconv'

s = 'caffÃ¨'

ic_ignore = Iconv.new('US-ASCII//IGNORE', 'UTF-8')
puts ic_ignore.iconv(s) # => caff

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
puts ic_translit.iconv(s) # => caff`e

(from here:
http://www.ruby-forum.com/topic/70827),
but instead of the promised result in the comments above,
I am getting:

corr_ebook.rb:29:in `iconv': "\351" (Iconv::InvalidCharacter)
from corr_ebook.rb:29

Why ?
I am using ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux] (OpenSuse 10.2)

Thank you very much!

Best regards,

Axel

Jonathan Hudson · Nov 3, 2007

On Sat, 3 Nov 2007 10:38:22 -0500

Dear all,

when using Iconv, I am repeatedly running into
problems.
I tried to run this bit of code:

#!/usr/bin/env ruby
$KCODE = 'u'
require 'iconv'

s = 'caffÃƒÂ¨'

ic_ignore = Iconv.new('US-ASCII//IGNORE', 'UTF-8')
puts ic_ignore.iconv(s) # => caff

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
puts ic_translit.iconv(s) # => caff`e

(from here:
http://www.ruby-forum.com/topic/70827),
but instead of the promised result in the comments above,
I am getting:

corr_ebook.rb:29:in `iconv': "\351" (Iconv::InvalidCharacter)
from corr_ebook.rb:29

Why ?
I am using ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux] (OpenSuse 10.2)

Thank you very much!

Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)

man iso_8859-1 shows octal 351 as expected.

351 233 E9 Ã© LATIN SMALL LETTER E WITH ACUTE

-jh

AEtzold · Nov 3, 2007

-------- Original-Nachricht --------

Datum: Sun, 4 Nov 2007 00:55:04 +0900
Von: Jonathan Hudson <[email protected]>
An: (e-mail address removed)
Betreff: Re: Unicode illegal characters problem

On Sat, 3 Nov 2007 10:38:22 -0500

Dear all,

when using Iconv, I am repeatedly running into
problems.
I tried to run this bit of code:

#!/usr/bin/env ruby
$KCODE = 'u'
require 'iconv'

s = 'caffÃƒÂ¨'

ic_ignore = Iconv.new('US-ASCII//IGNORE', 'UTF-8')
puts ic_ignore.iconv(s) # => caff

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
puts ic_translit.iconv(s) # => caff`e

(from here:
http://www.ruby-forum.com/topic/70827),
but instead of the promised result in the comments above,
I am getting:

corr_ebook.rb:29:in `iconv': "\351" (Iconv::InvalidCharacter)
from corr_ebook.rb:29

Why ?
I am using ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux] (OpenSuse 10.2)

Thank you very much!

Click to expand...

Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)

man iso_8859-1 shows octal 351 as expected.

351 233 E9 Ã© LATIN SMALL LETTER E WITH ACUTE

-jh

Dear Jonathan,

thanks for the hint. You are right. I corrected the encoding
of the file I read the text in from,

$KCODE = 'u'
require 'iconv'
s=IO.readlines("/home/axel/text.txt").to_s
p s # => 'caffÃ¨'

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
puts ic_translit.iconv(s) # => caff`e

However, now I still get
"caff?" instead of "caff`e" as promised.

I have several novel-length texts to convert with many
different accents.

Thanks for helping me again!

Best regards

Axel

Jonathan Hudson · Nov 3, 2007

On Sat, 3 Nov 2007 11:20:33 -0500

-------- Original-Nachricht --------

Datum: Sun, 4 Nov 2007 00:55:04 +0900
Von: Jonathan Hudson <[email protected]>
An: (e-mail address removed)
Betreff: Re: Unicode illegal characters problem

Click to expand...

On Sat, 3 Nov 2007 10:38:22 -0500

Dear all,

when using Iconv, I am repeatedly running into
problems.
I tried to run this bit of code:

#!/usr/bin/env ruby
$KCODE = 'u'
require 'iconv'

s = 'caffÃƒÂƒÃ‚Â¨'

ic_ignore = Iconv.new('US-ASCII//IGNORE', 'UTF-8')
puts ic_ignore.iconv(s) # => caff

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
puts ic_translit.iconv(s) # => caff`e

(from here:
http://www.ruby-forum.com/topic/70827),
but instead of the promised result in the comments above,
I am getting:

corr_ebook.rb:29:in `iconv': "\351" (Iconv::InvalidCharacter)
from corr_ebook.rb:29

Why ?
I am using ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux] (OpenSuse 10.2)

Thank you very much!

Click to expand...

Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)

man iso_8859-1 shows octal 351 as expected.

351 233 E9 ÃƒÂ© LATIN SMALL LETTER E WITH ACUTE

-jh

Click to expand...

Dear Jonathan,

thanks for the hint. You are right. I corrected the encoding
of the file I read the text in from,

$KCODE = 'u'
require 'iconv'
s=IO.readlines("/home/axel/text.txt").to_s
p s # => 'caffÃƒÂ¨'

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
puts ic_translit.iconv(s) # => caff`e

However, now I still get
"caff?" instead of "caff`e" as promised.

I believe that's a "feature" of ruby iconv.

$ echo cafÃ© | iconv -f UTF-8 -t ASCII//TRANSLIT
cafe

while

s="cafÃ©"
ic_translit = Iconv.new('ASCII//TRANSLIT', 'UTF-8')
puts ic_translit.iconv(s)
=> caf?

-jonathan

AEtzold · Nov 3, 2007

Dear Jonathan,

I believe that's a "feature" of ruby iconv.

thanks for your clarifications!

Best regards,

Axel

Jonathan Hudson · Nov 3, 2007

On Sat, 3 Nov 2007 12:06:21 -0500

Dear Jonathan,

thanks for your clarifications!

Further, its a *feature* of iconv on **Linux**. On my FreeBSD box I
get the expected results, both from iconv in a shell and ruby => caf'e.

As, on Linux, the iconv application produces better results from ruby's
iconv, I tend to pipe data through iconv; at least I get a semblance
of usability that way.

-jonathan

angus · Nov 3, 2007

On Sat, 3 Nov 2007 12:06:21 -0500

Further, its a *feature* of iconv on **Linux**. On my FreeBSD box I
get the expected results, both from iconv in a shell and ruby => caf'e.

As, on Linux, the iconv application produces better results from ruby's
iconv, I tend to pipe data through iconv; at least I get a semblance
of usability that way.

It's not iconv, it's your locale data (which iconv uses). In german, "Ã¼" is
probably transliterated to ASCII as "ue". In spanish, as "u". There isn't a
single way to do it, and they are encoded in the system locale files.

Now, why ruby's iconv gives a different result than the program iconv... I
don't know. Maybe ruby hides some LC_* environment variables from the
library (wild -and probably incorrect- guess)...

Summing up: don't use iconv to transliterate to ASCII; build your own table
instead. (It's easy: the description of all latin letters with diacritics
follow the same pattern.)

Good luck.

--

bbxx789_05ss · Nov 3, 2007

Dear Jonathan,

thanks for your clarifications!

How does that clarify things for you? I read the other thread, and that
doesn't clarify anything for me. Are you simply interpreting Jonathan
Hudson's statement to mean the other thread is wrong?

Also, I don't think it is very helpful to include every possible unicode
statement you can think of in an attempt solve unicode problems. For
instance, this line:

$KCODE = 'u'

Why are you including that line in your program? According to Ruby
Way(2nd), p. 141,

"...$KCODE...determines the behavior of many core methods that
manipulate strings. "

However, in the code you posted, as far as I can tell, you aren't
calling any methods where the $KCODE changes the way they work. Do you
just include that line anytime you are dealing with unicode, or did you
include it for some specific reason?

Thanks.

bbxx789_05ss · Nov 3, 2007

Axel said:
-------- Original-Nachricht --------

Dear Jonathan,

thanks for the hint. You are right. I corrected the encoding
of the file I read the text in from,

$KCODE = 'u'
require 'iconv'
s=IO.readlines("/home/axel/text.txt").to_s
p s # =>
'caffè'

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')

puts ic_translit.iconv(s) # => caff`e

However, now I still get
"caff?" instead of "caff`e" as promised.

Another data point:

require 'iconv'

s = "caf\_x_c3\_x_a9"
#The last char is the utf-8 encoding in hex format for 'e' with acute
#I added the underscores so that the encoding won't be rendered
#into the actual character

puts s

#I see cafe where the 'e' is an 'e' with acute, which means my
#display device understands utf-8.

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
puts ic_translit.iconv(s)

#I see: caf'e

bbxx789_05ss · Nov 3, 2007

Axel said:
Dear Jonathan,

thanks for the hint. You are right. I corrected the encoding
of the file I read the text in from,

$KCODE = 'u'
require 'iconv'
s=IO.readlines("/home/axel/text.txt").to_s
p s # =>
'caffè'

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')

puts ic_translit.iconv(s) # => caff`e

However, now I still get
"caff?" instead of "caff`e" as promised.

Try running this code:

require 'iconv'

s = "caf\_x_c3\_x_a9" #remove underscores
p s
#I see: caf\_303\_251 (without the underscores)
#\_303\_251 (without the underscores) is the utf-8
#encoding in octal format. I really hate that ruby
#displays octal format instead of hex format!

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
new_s = ic_translit.iconv(s) # => caff`e

p new_s #I see: caf'e

AEtzold · Nov 3, 2007

Dear 7stud,

thanks for the effort that you are putting into
solving this problem.
When I thanked about the clarifications Jonathan
gave, I meant that I believe the solution I hoped
to get from the thread I got that code from in
the first place isn't going to work for me as
easily as thought.
I do indeed get a different behaviour for system
iconv and Ruby iconv, as Jonathan said.
With respect to the code you sent me, I
get:

require 'iconv'

s = "caf\xc3\xa9" #(having removed underscores)
p s # => caf\303\251"
ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
new_s = ic_translit.iconv(s) # => caff`e
p new_s #=> caf?

What system are you on ?
Mine is Linux OpenSuse 10.2, 64bit, Ruby 1.8.6.
If I had your behaviour on my system, this transliteration
would provide a nice conversion of the accents to Latex,
but now, I think I'll do maybe two dozen gsub lines ...
unless there already is some script that does
a Unicode name to Latex accent conversion, sth. like

small latin letter <lettername> with acute => \'{<lettername>} ?

Best regards,

Axel

-------- Original-Nachricht --------

bbxx789_05ss · Nov 4, 2007

Axel said:
With respect to the code you sent me, I
get:

require 'iconv'

s = "caf\xc3\xa9" #(having removed underscores)

First of all, even posting messages about unicode is hard to do because
you have no idea what the other person is seeing. For instance, when
you say 'I see this output:

cafÃ©

you have no idea how my display device is displaying those characters,
and I have no idea how your display device is displaying those
characters. Does your display device not understand the encoding so
there is a question mark at the end: caf?, and my display device does
understand the encoding, so I see an 'e' with acute'? Or, do you see an
'e' with acute, but I see a question mark? You just can't be sure what
the other person is seeing. I used underscores intermingled with my
character encodings to prevent *any* display device from rendering them.
That way anyone reading the code will know exactly what's there.

As a result, to be clear what's going on, you don't want to be posting:

s = "caf\xc3\xa9" #(having removed underscores)

You want to leave the underscores in when posting about character
encodings. Of course, when you run the code, you need to remove the
underscores.

What system are you on ?
Mine is Linux OpenSuse 10.2, 64bit, Ruby 1.8.6.
If I had your behaviour on my system, this transliteration
would provide a nice conversion of the accents to Latex,

mac os x 10.4.7, pre-installed ruby 1.8.2

AEtzold · Nov 4, 2007

-------- Original-Nachricht --------

Datum: Sun, 4 Nov 2007 17:06:38 +0900
Von: 7stud -- <[email protected]>
An: (e-mail address removed)
Betreff: Re: Unicode illegal characters problem

First of all, even posting messages about unicode is hard to do because
you have no idea what the other person is seeing. For instance, when
you say 'I see this output:

cafÃ©

Dear 7stud,

you have no idea how my display device is displaying those characters,
and I have no idea how your display device is displaying those
characters. Does your display device not understand the encoding so
there is a question mark at the end: caf?, and my display device does
understand the encoding, so I see an 'e' with acute'? Or, do you see an
'e' with acute, but I see a question mark? You just can't be sure what
the other person is seeing. I used underscores intermingled with my
character encodings to prevent *any* display device from rendering them.
That way anyone reading the code will know exactly what's there.

As a result, to be clear what's going on, you don't want to be posting:

s = "caf\xc3\xa9" #(having removed underscores)

Well thanks for pointing that out, but that's
not a problem here and cannot be, as I just posted some code snippet to tell what I was doing to get that result -- as it originally came from you, you can't possibly misunderstand it, can you ?

You want to leave the underscores in when posting about character
encodings. Of course, when you run the code, you need to remove the
underscores.

I am intelligent enough to understand this -- the comment was just
to say that I indeed did remove those underscores.

mac os x 10.4.7, pre-installed ruby 1.8.2

So there seems to be some different behaviour of iconv (Ruby
and Linux/Unix) for different OS, independently
from actual or possible rendering issues, which is what Jonathan,
Carlos and I found in our previous discussion.

Nice to know that, nevertheless. :-|

I'd like to say thanks to all of you for your posts.

Best regards,

Axel

bbxx789_05ss · Nov 4, 2007

Axel said:
Well thanks for pointing that out, but that's
not a problem here and cannot be, as I just posted some code snippet to
tell what I was doing to get that result -- as it originally came from
you, you can't possibly misunderstand it, can you ?

1) Who says I'm viewing your current post with the same display device
that I used to write my previous post? People these days own pc's,
laptops, cell phones, etc. -- all of which can be used to browse the
internet, and all of which may understand different encodings.

2) Who says the encoding that was set on the display device that I used
it to send my earlier post hasn't been set to another encoding in the
meantime? All it takes is a simple click on View>Text Encoding>some
other encoding.

AEtzold · Nov 5, 2007

-------- Original-Nachricht --------

Datum: Mon, 5 Nov 2007 00:44:54 +0900
Von: 7stud -- <[email protected]>
An: (e-mail address removed)
Betreff: Re: Unicode illegal characters problem

Dear 7stud,

1) Who says I'm viewing your current post with the same display device
that I used to write my previous post? People these days own pc's,
laptops, cell phones, etc. -- all of which can be used to browse the
internet, and all of which may understand different encodings.
nobody.

2) Who says the encoding that was set on the display device that I used
it to send my earlier post hasn't been set to another encoding in the
meantime? All it takes is a simple click on View>Text Encoding>some
other encoding.

nobody. Yet if you do these things, you put yourself into the danger
of not being perceived as particularly helpful, as the quality of any
advice on this list is, if in doubt, to be measured against whether
it leads to working code on the original poster's machine, not whether
one might deliberately be able to create misunderstandings.

Best regards,

Axel

bbxx789_05ss · Nov 5, 2007

Axel said:
nobody. Yet if you do these things, you put yourself into the danger
of not being perceived as particularly helpful, as the quality of any
advice on this list is, if in doubt, to be measured against whether
it leads to working code **on the original poster's machine**,

Unfortunately, you don't get it. I have no idea what the encoding is on
your machine. You have no idea what the encoding is on *anyone's*
machine that responds to your post--and most likely they're all
different. In order to discuss unicode problems without creating
confusion that can often produce conflicting advice, you can't just post
a bunch of characters which may or may not get rendered for other people
the same way you see them.

I recommended that when you ask unicode questions that you put
underscores in the characters in the code you post. That way no machine
can possibly render them into the character they represent. Then
everyone who reads your post can know exactly what characters *you* are
dealing with. I also recommended that when you post output that you
describe the output you see, rather than just posting the output--that
way everyone will know what *you* see. If you don't care to do that,
that is your choice. Most people won't even respond to unicode
questions. If you follow my suggestions, I think it will make it easier
for the few people who do.

Personally, I don't keep track of the current settings for the encodings
on the various machines I use: work pc, home pc, multiple laptops, cell
phones. I certainly don't synchronize them. And if someone changes the
encoding on one of those machines, or I change it and forget to change
it back, I won't realize it. Typically, I'll read a post and if it's
totally confusing, presumably because what I see is different than what
the op is describing, I move on---which I'm sure at this point is
something you wish I would do. So I will.

Problem with Iconv	0	Aug 26, 2009
iconv emacs trouble	1	Jul 15, 2009
Ruby 1.9.2: /\w/u does not match umlauts ("Ã¼")	4	Sep 29, 2010
Iconv hangs while converting chinese UTF-8 to ascii, please help.	2	Sep 29, 2007
iconv problems with different machines	4	Dec 5, 2007
Ruby Weekly News 26th June - 2nd July 2006	2	Jul 4, 2006
Enhancing the Gateway (Help Needed)	24	Oct 28, 2007
SOAP wsdlDriver doesn't work without soap4r?	0	Jan 17, 2009

Unicode illegal characters problem

AEtzold

Jonathan Hudson

AEtzold

Jonathan Hudson

AEtzold

Jonathan Hudson

angus

bbxx789_05ss

bbxx789_05ss

bbxx789_05ss

AEtzold

bbxx789_05ss

AEtzold

bbxx789_05ss

AEtzold

bbxx789_05ss

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads