How are people making use of Iconv?

W

Wilson Bilkovich

Since Iconv jumped out of the pond and chewed on my leg the other
week, I've been toying with the idea of a character-set conversion
library implemented totally in Ruby, with identical behavior on every
platform.
However, I'm only using Iconv for simple things, like converting my
music tags from Shift-JIS to UTF-8.

What 'serious' things are people using this for? Are there any unit
tests? Any gems on RubyForge I can download containing projects that
make use of Iconv? What do you hate about Iconv?

Thanks,
--Wilson.
 
A

Andreas S.

Wilson said:
Since Iconv jumped out of the pond and chewed on my leg the other
week, I've been toying with the idea of a character-set conversion
library implemented totally in Ruby, with identical behavior on every
platform.
However, I'm only using Iconv for simple things, like converting my
music tags from Shift-JIS to UTF-8.

Well, that's all that Iconv is supposed to be used for.
What 'serious' things are people using this for? Are there any unit
tests? Any gems on RubyForge I can download containing projects that
make use of Iconv?

Rails uses Iconv, at least in ActionMailer.
What do you hate about Iconv?

I dislike that Iconv raises an exception when it finds characters it can
not convert. I would prefer if it could be made to ignore invalid
characters and just try to make the best of the text.
 
P

Paul Duncan

--SLOnmNMLAcnry0J+
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Andreas S. ([email protected]) wrote:
[snipped]
I dislike that Iconv raises an exception when it finds characters it can= =20
not convert. I would prefer if it could be made to ignore invalid=20
characters and just try to make the best of the text.

Seconded, Thirded, and Quadrupled.

Iconv needs a "as close as I could get with transliteration and ignoring
invalid characters" mode.

We're doing something comparable in Raggle by trapping the exception and
stripping out the invalid character. Obviously this doesn't work
properly for multibyte characters, and you won't be able to use a lookup
table for arbitrary source encodings, but it's a start.

begin
# convert element_text to native charset (note: in this case we're
# converting from utf-8 to the native charset, but the only thing
# about the code that's utf-8 specific is the assumption about
# character width and the unicode lookup table below)
ret =3D $iconv.iconv(element_text) << $iconv.iconv(nil)
rescue Iconv::IllegalSequence =3D> e
# save the portion of the string that was successful, the=20
# invalid character, and the remaining (pending) string
success_str =3D e.success
ch, pending_str =3D e.failed.split(//, 2)
ch_int =3D ch.to_i

# see if we have a map for that characters
if String::UNICODE_LUT.has_key?(ch_int)
# we have a mapping for this character, so convert it and
# re-process the string

# log status
err_str =3D _('converting unicode')
$log.warn(meth) { "#{err_str} ##{ch_int}" }

# create new string, with the bad character mapped
element_text =3D success_str + UNICODE_LUT[ch_int] + pending_str
else
if $config['iconv_munge_illegal']
# munge the illegal character with a safe string

# log status
err_str =3D _('munging unicode')
$log.warn(meth) { "#{err_str} ##{ch_int}" }

# create new string, with the bad character munged
munge_str =3D $config['unicode_munge_str']
element_text =3D success_str + munge_str + pending_str
else
# just drop the character altogether

# log status
err_str =3D _('dropping unicode')
$log.warn(meth) { "#{err_str} ##{ch_int}" }

# create new string, sans the bad character
element_text =3D success_str + pending_str
end
end
retry
end =20

Not a perfect solution, but it helps a bit.

--=20
Paul Duncan <[email protected]> pabs in #ruby-lang (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562

--SLOnmNMLAcnry0J+
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFDqWyfzdlT34LClWIRArZXAJ9FPc/IvlgyPI5coH3mRdZyqBMCJgCgiydX
FZyP/peF9b1zmZq6RH2u5iw=
=zDrq
-----END PGP SIGNATURE-----

--SLOnmNMLAcnry0J+--
 
W

Wilson Bilkovich

* Andreas S. ([email protected]) wrote:
[snipped]
I dislike that Iconv raises an exception when it finds characters it ca= n
not convert. I would prefer if it could be made to ignore invalid
characters and just try to make the best of the text.

Seconded, Thirded, and Quadrupled.

Iconv needs a "as close as I could get with transliteration and ignoring
invalid characters" mode.

We're doing something comparable in Raggle by trapping the exception and
stripping out the invalid character. Obviously this doesn't work
properly for multibyte characters, and you won't be able to use a lookup
table for arbitrary source encodings, but it's a start.
<snip interesting code>

What if String just had a couple of new methods on it:
String#transcode(from_encoding, to_encoding)
..and
String#transcode!(from_encoding, to_encoding)
..and the "modifies receiver" version returned true or false,
depending on whether it managed to convert every character?
Then you could do:
unless some_string.transcode!('Shift-JIS', 'UTF-8')
puts "Some characters got mangle-fied!"
end

Is that a mess? I kinda like it, at first glance.
 
P

Paul Duncan

--PDzrc2MStrmgSXgo
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Wilson Bilkovich ([email protected]) said:
* Andreas S. ([email protected]) wrote:
[snipped]
I dislike that Iconv raises an exception when it finds characters it = can
not convert. I would prefer if it could be made to ignore invalid
characters and just try to make the best of the text.

Seconded, Thirded, and Quadrupled.

Iconv needs a "as close as I could get with transliteration and ignoring
invalid characters" mode.

We're doing something comparable in Raggle by trapping the exception and
stripping out the invalid character. Obviously this doesn't work
properly for multibyte characters, and you won't be able to use a lookup
table for arbitrary source encodings, but it's a start.
<snip interesting code>
=20
What if String just had a couple of new methods on it:
String#transcode(from_encoding, to_encoding)
..and
String#transcode!(from_encoding, to_encoding)
..and the "modifies receiver" version returned true or false,
depending on whether it managed to convert every character?
Then you could do:
unless some_string.transcode!('Shift-JIS', 'UTF-8')
puts "Some characters got mangle-fied!"
end
=20
Is that a mess? I kinda like it, at first glance.

I know a future version of Ruby (2.0?) will make a distinction between
strings as arrays of bytes and strings as sets of characters with an
encoding (with the former being an obvious superset of the latter), so
I'm not sure how well that method would work with the new way of
handling strings.

That said, I like the idea, although I'd like an optional block to
handle unknown characters. I'd also add an hash as an optional third
argument which allows you to toggle transliteration, munging, and
exception behavior.

--=20
Paul Duncan <[email protected]> pabs in #ruby-lang (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562

--PDzrc2MStrmgSXgo
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFDqZHazdlT34LClWIRAjRbAJsEtvFNzaKZDtrxiZGacEl6jdEM0wCfWnFV
q9msx0Kq0120iTF8C9ILR14=
=npn0
-----END PGP SIGNATURE-----

--PDzrc2MStrmgSXgo--
 
P

Paul Duncan

--cycRQitB/3DOnQyb
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Christian Neukirchen ([email protected]) said:
Paul Duncan said:
* Andreas S. ([email protected]) wrote:
[snipped]
I dislike that Iconv raises an exception when it finds characters it c= an=20
not convert. I would prefer if it could be made to ignore invalid=20
characters and just try to make the best of the text.

Seconded, Thirded, and Quadrupled.

Iconv needs a "as close as I could get with transliteration and ignoring
invalid characters" mode.
=20
Can't you just use //IGNORE?

I wasn't aware of "//IGNORE". I'll check it out. Thanks!

--=20
Paul Duncan <[email protected]> pabs in #ruby-lang (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562

--cycRQitB/3DOnQyb
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFDqa1ozdlT34LClWIRAtuzAKCUdBmpLUTes1qmVhnaVDLh9cfblgCg3JJO
QCUz1TEO4/pbbm8A9s9rdr4=
=+17s
-----END PGP SIGNATURE-----

--cycRQitB/3DOnQyb--
 
P

Paul Duncan

--GohmpbibSJzDFTQZ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Christian Neukirchen ([email protected]) said:
Paul Duncan said:
* Andreas S. ([email protected]) wrote:
[snipped]
I dislike that Iconv raises an exception when it finds characters it c= an=20
not convert. I would prefer if it could be made to ignore invalid=20
characters and just try to make the best of the text.

Seconded, Thirded, and Quadrupled.

Iconv needs a "as close as I could get with transliteration and ignoring
invalid characters" mode.
=20
Can't you just use //IGNORE?

You sir, are a genius. That works great here.

--=20
Paul Duncan <[email protected]> pabs in #ruby-lang (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562

--GohmpbibSJzDFTQZ
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFDqa5kzdlT34LClWIRAgnyAJ448vk1Q0YWRusDLQzqsnRiAfvMQwCfTafP
ZqH4EreCDBoIDztFARcY7h0=
=d2bB
-----END PGP SIGNATURE-----

--GohmpbibSJzDFTQZ--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,201
Messages
2,571,048
Members
47,651
Latest member
VeraPiw932

Latest Threads

Top