Is there a better way to convert foreign characters?

sln · Apr 23, 2009

I strongly disagree. Unicode has its weak points, but it is still
incomparably better that any scheme a Joe Xispack would invent
herself.... Witness the disaster with Emacs Internationalization.

Just:

existence of the notion of "Unicode character",

a possibility of specifying a character unambiguously (with some
minor hair-splitting needed sometimes, as in o-trema vs o-umlaut, or
in CJK), and

having a list of "property" *names* (which is, basically, the
information about how other people look at individual characters)

should be, IMO, an enormous help in the design of what you call
"manipulations". And I did not even touch "tables", i.e., the *values*
of these properties: it is a major work in itself...

Yours,
Ilya

Unicode is a nightmare. Encoding 1-6 bytes (or more) to represent the
whole range of possible multiple code rendering(s) of character(s) of all
the languages in the world is just out of control.

Internal data manipulation is a nightmare, a hog, and slow as hell.
Is it a byte, a word, int or more? 0 .. (2**32-1) or more! Optimizations?
Encode/Decoding, back and forth. Just a nightmare. And what is it, what
is the encoding of that? Dunno, take a guess! "L,that sucks man!";

Unicode, the expression of everything that does nothing (good).

-sln

Helmut Wollmersdorfer · Apr 23, 2009

Ilya Zakharevich wrote:
[Unicode]

a possibility of specifying a character unambiguously (with some
minor hair-splitting needed sometimes, as in o-trema vs o-umlaut, or
in CJK), and

.... can not decompose 'overlay diacritics' like l-stroke or o-stroke

having a list of "property" *names* (which is, basically, the
information about how other people look at individual characters)

e.g. distinguish 'confusables' like cyr-A versus latin-A

should be, IMO, an enormous help in the design of what you call
"manipulations". And I did not even touch "tables", i.e., the *values*
of these properties: it is a major work in itself...

Of course. Matching Unicode properties may be slow, but it's far better
than maintaining a table myself (for a language or script I do not know).

Also the tables in Unicode locales are a great work, very incomplete
(e.g. transliteration), but save time in some other topics.

Helmut Wollmersdorfer

Helmut Wollmersdorfer · Apr 23, 2009

Unicode is a nightmare.

Writing systems of the world are a nightmare. Unicode just documents them.

Encoding 1-6 bytes (or more) to represent the
whole range of possible multiple code rendering(s) of character(s) of all
the languages in the world is just out of control.

Unicode defines a character *set* not a character *encoding*.

Internal data manipulation is a nightmare, a hog, and slow as hell.

I disagree. Maybe slow, if you use property matching in Perl5.

Is it a byte, a word, int or more? 0 .. (2**32-1) or more!

It is a character - nice to handle in Perl 5.8.

Encode/Decoding, back and forth.

Every system needs encode/decode between internal and external
representation of characters, if encodings differ. If your Perl programs
are well designed, you need it just in one place - in the open statement.

Just a nightmare. And what is it, what
is the encoding of that?

That's the problem which Unicode helps to solve. There are hundreds of
non-Unicode encodings in the wild, some very exotic like
7-bit-ASCII-German, some undocumented.

Unicode, the expression of everything that does nothing (good).

It's your responsibility to use it in a good or bad way.

Helmut Wollmersdorfer

Tim McDaniel · Apr 23, 2009

( $word = lc $value ) =~ tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;

I don't combine s///, tr///, or chomp with assignments -- personal
idiom and I'm not familiar with the Perl effects. The above assigns
the lowercase translation of $value to $word, and then does a tr/// on
$word, right? Then there should be no need for the capitalized
characters in the tr///, because there shouldn't be any to match.

I agree with the other posters who suggest using standard modules,
like Undiacritical or whatever it was.

Guy · Apr 23, 2009

Guy said:
I'm sure there are many ways to do this, but is there a much better way?

$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
$word=lc($value);

I want $word to equal the english version of $value. So if
$value="Théodore", I want $word="theodore". I'd like to do it in one
statement if possible but I think I have to convert $value in one
statement and then assign it to $word in another statement.

Cheers!
Guy

Just to explain a little. I have a few hundred old pictures of this city
from the 1900 to 1940, when the city was just a town of about 100 houses. I
want to allow the local population to search through the photos, perhaps
find their grand-parents or even great-grand-parents. Like today, many of
the folks back then were french, with names like "Roméo" or "Théodore".
Despite this, most people here have english keyboards and I suspect that
many don't even know how to type french characters like "é". Therefore, I
suspect that people will just search for "Theodore" or "Romeo", and perhaps
in lowercase too, such as "theodore" or "romeo". The names are just english
and french, nothing else, at least not in this project. Thanks for all,
Guy

Tim McDaniel · Apr 24, 2009

Those are not standard modules.

*sigh* Not "standard" as in distributed with Perl.
But Text::Undiacritic is in CPAN and is therefore easy to get.

Peter J. Holzer · Apr 28, 2009

Finding out the effects is trivial, isn't it?

Yes. So you do know about the effects, after all. ;-)

That's true only if a suitable locale is enabled.

Or if the $value is a character string.

If a programmer wants to do that kind of transliteration, there is a
great chance that s/he doesn't care about any kind of i18n or l10n.

The simple fact that he does specifically operate on accented characters
shows that he *does* care.

If $value is a byte string and no locale is in effect, lc on a non-ASCII
string is poorly defined. If the string is in a multi-byte encoding lc
might convert a byte which happens to be part of a character, which is
almost certainly wrong. Also, tr almost certainly doesn't work as
intended.

In a single-byte encoding which is a superset of ASCII (e.g. ISO-8859-X)
the code works, because lc is a noop on all accented characters. But I
still think this is unclean. You should convert to ASCII first and then
case-fold.

(of course I really think you should use character strings if you do
operations on characters, and not muck around with byte strings)

hp

Tim McDaniel · Apr 28, 2009

In a single-byte encoding which is a superset of ASCII
(e.g. ISO-8859-X) the code works, because lc is a noop on all
accented characters.

!?!?! So, even if the local is set to Latin-1, lc('A') produces 'a',
but lc([A with acute accent]) is [A with acute accent]?! What sort of
nonsense is that?

Gunnar Hjalmarsson · Apr 28, 2009

Peter said:
Or if the $value is a character string.

Hmm.. Yes, so it seems. I wasn't aware of that.

(of course I really think you should use character strings if you do
operations on characters, and not muck around with byte strings)

So far, I haven't bothered with encoding/decoding when I have been
working with Latin-1. Are you saying that encoding/decoding is advisable
even if you are not dealing with UTF-8 or some other encoding with wide
characters?

Gunnar Hjalmarsson · Apr 28, 2009

Tim said:
In a single-byte encoding which is a superset of ASCII
(e.g. ISO-8859-X) the code works, because lc is a noop on all
accented characters.

Click to expand...

!?!?! So, even if the local is set to Latin-1, lc('A') produces 'a',
but lc([A with acute accent]) is [A with acute accent]?!

No.

$ perl -MPOSIX -le '
setlocale LC_CTYPE, "sv_SE.iso88591";
print lc "ÀÉÊÇÔ";
use locale;
print lc "ÀÉÊÇÔ";
'
ÀÉÊÇÔ
àéêçô
$

But there was no playing with locales in the code we were discussing.

What sort of nonsense is that?

Quoting out of context?

Jürgen Exner · Apr 28, 2009

!?!?! So, even if the local is set to Latin-1, lc('A') produces 'a',
but lc([A with acute accent]) is [A with acute accent]?! What sort of
nonsense is that?

Aside of the other replies please keep in mind that for some letters
there is no upper case or lower case equivalent letter.
Just one example would be the German sharp s: ß, which never occurs at
the beginning of a word and if capitalized in an all-uppercase word
would be written as a double S: SS.
There are also examples where two lower-case letters are mapped to the
same upper-case letter. How do you map the upper-case letter back into
lower case without knowing the context, i.e. the word it is used in?

jue

Ilya Zakharevich · Apr 28, 2009

If $value is a byte string and no locale is in effect, lc on a non-ASCII
string is poorly defined.

??? In absense of `use locale', lc should convert to lower-case using
Unicode case-conversion tables. What is "poorly defined" in this semantic?

In a single-byte encoding which is a superset of ASCII (e.g. ISO-8859-X)
the code works, because lc is a noop on all accented characters.

What exactly do you mean here?

perl -Mcharnames=latin -wle "print qq(\N{AE})" Æ
perl -Mcharnames=latin -wle "print lc qq(\N{AE})"

æ

I must be missing something...

Yours,
Ilya

Ilya Zakharevich · Apr 29, 2009

Peter is talking about byte strings. \N produces utf8 strings, even if
it didn't need to.

There is no such thing as "byte strings" or "utf8 strings". Strings
are strings...

However, there are such things as bugs in perl:

perl -Mcharnames=latin -wle "print ord chr ord qq(\N{AE})"
198
perl -Mcharnames=latin -wle "print ord lc chr ord qq(\N{AE})"
198

Just a bug,
Ilya

Peter J. Holzer · Apr 29, 2009

In a single-byte encoding which is a superset of ASCII
(e.g. ISO-8859-X) the code works, because lc is a noop on all
accented characters.

Click to expand...

!?!?! So, even if the local is set to Latin-1, lc('A') produces 'a',
but lc([A with acute accent]) is [A with acute accent]?!

Gunnar and I were specifically talking about the case where *no* locale
is active.

What sort of nonsense is that?

The sort of nonsense you when when you don't read postings carefully
enough.

hp

Peter J. Holzer · Apr 29, 2009

Peter said:
Peter said:

Tim McDaniel wrote: [after calling lc]
Then there should be no need for the capitalized characters in the
tr///, because there shouldn't be any to match.

That's true only if a suitable locale is enabled.

Click to expand...

Or if the $value is a character string.

Click to expand...

Hmm.. Yes, so it seems. I wasn't aware of that.

(of course I really think you should use character strings if you do
operations on characters, and not muck around with byte strings)

Click to expand...

So far, I haven't bothered with encoding/decoding when I have been
working with Latin-1. Are you saying that encoding/decoding is advisable
even if you are not dealing with UTF-8 or some other encoding with wide
characters?

Yes.

* Perl knows that a character string is a character string. So matching
against character classes, lc, uc, etc. works automatically.
* You don't have to care about the encoding within your program.
Only for I/O you have to decode/encode, and that can usually be done
with an I/O-layer. So all the encoding-specific stuff is centralized
in one place: Where the file is opened.

For me the rule of thumb is

* When you read character data from an external source, decode it
immediately. If there is an automatic way to do that (I/O layer,
option for the DBD, etc.) use that.
* When you write character data to an external source, encode it as
late as possible. Again, use an automatic way if there is one.

Then, within my program, I know that all character data is in character
strings and everything "just works" whether the data came from a latin-1
file or a utf-8 file or database in big-5. And all the byte data (e.g.,
blobs, images, etc.) is in byte strings, and that also just works.

hp

Peter J. Holzer · Apr 29, 2009

??? In absense of `use locale', lc should convert to lower-case using
Unicode case-conversion tables.

For a byte-string? No, it doesn't, and I think it shouldn't (I know some
people disagree on the latter).

What is "poorly defined" in this semantic?

You don't know whether an octet is a character. For example, IIRC the
ISO-2022-JP encoding uses octets in the range 0x20-0x7F in multi-byte
encodings. If you apply lc (without locale information) to an
ISO-2022-JP encoded string, it blindly replace all octets 0x41-0x5A with
0x61-0x7A, thereby replacing Japanese characters with completely
unrelated Japanese characters.

What exactly do you mean here?

æ

bernon:~/tmp 12:27

117% perl -CO -wle 'print qq(\x{C6})'
Æ
bernon:~/tmp 12:27

118% perl -CO -wle 'print lc qq(\x{C6})'
Æ

hp

Peter J. Holzer · Apr 29, 2009

There is no such thing as "byte strings" or "utf8 strings".

There is. You may wish that this wasn't the case but that's just wishful
thinking. There *are* differences between byte strings and character
strings. These differences are documented. So they exist in both "perl"
and "Perl".

hp

Peter J. Holzer · Apr 29, 2009

The sort of nonsense you when when you don't read postings carefully
enough.

And this sentence is the sort of nonsense you get when you press the
send button before proof-reading :-(. s/when/get/

hp

Ilya Zakharevich · Apr 29, 2009

For a byte-string? No, it doesn't, and I think it shouldn't

I'd like to hear the logic behind this...

You don't know whether an octet is a character.

If lc is applied to a string, it consists of characters.

For example, IIRC the ISO-2022-JP encoding uses octets in the range
0x20-0x7F in multi-byte encodings. If you apply lc (without locale
information)

So do not... You do not apply lc() to gzipped strings, right? 1/3 ;-)

Yours,
Ilya

Gunnar Hjalmarsson · Apr 29, 2009

Peter said:
Yes.

* Perl knows that a character string is a character string. So matching
against character classes, lc, uc, etc. works automatically.
* You don't have to care about the encoding within your program.
Only for I/O you have to decode/encode, and that can usually be done
with an I/O-layer. So all the encoding-specific stuff is centralized
in one place: Where the file is opened.

For me the rule of thumb is

* When you read character data from an external source, decode it
immediately. If there is an automatic way to do that (I/O layer,
option for the DBD, etc.) use that.
* When you write character data to an external source, encode it as
late as possible. Again, use an automatic way if there is one.

Then, within my program, I know that all character data is in character
strings and everything "just works" whether the data came from a latin-1
file or a utf-8 file or database in big-5. And all the byte data (e.g.,
blobs, images, etc.) is in byte strings, and that also just works.

Thanks for those useful comments, Peter. You gave me something to think
about.

I suppose, though, that your rule of thumb is only applicable as long as
you don't want backwards compatibility with pre 5.8 perl versions.

The current way of software code indentation masks the software control flow. There is a better alternate way	2	Mar 28, 2023
Optimal way to make a table for large lists	2	Jul 7, 2022
Is this right way to convert data attributes values to number in javascipt? Need to get valid numeric value or 0	2	May 30, 2023
Convert Excel Contacts to vCard Free Vs Paid Method	6	Jan 23, 2025
Is there a way to pass this state from component to the fetch?	1	Apr 24, 2023
Is there a way where i can limit the array output results?	1	Oct 19, 2022
Is there a way to get a single mode using all the points within a 2D array?	2	Oct 17, 2022
Expert Guide to Convert MBOX to PST File Manually in 2025	7	Dec 1, 2024

Is there a better way to convert foreign characters?

sln

Helmut Wollmersdorfer

Helmut Wollmersdorfer

Tim McDaniel

Guy

Tim McDaniel

Peter J. Holzer

Tim McDaniel

Gunnar Hjalmarsson

Gunnar Hjalmarsson

Jürgen Exner

Ilya Zakharevich

Ilya Zakharevich

Peter J. Holzer

Peter J. Holzer

Peter J. Holzer

Peter J. Holzer

Peter J. Holzer

Ilya Zakharevich

Gunnar Hjalmarsson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads