Perl and German Umlauts

D

Dennis Winter

Hi NG,

all my attempts to solve this using perldoc and google failed..: :(

I installed the ActivePerl 5.8.4.810.msi on a W2kSP4-box. I installed
the POSIX-locale-module, Unicode-*, Charset...

If I try to

use POSIX;
use locale;
$| = 1;
setlocale POSIX::LC_COLLATE, "de_DE";
print "AÄä OÖö UÜü ssß";

I get some weird characters (hope I don?t crumple this post by copying
this characters in here...;)):

A-õ OÍ÷ U_³ ss¯

E:\htdocs\acltree>perl -V:locale
locale='UNKNOWN';

The OS is running as a German environment. How can I tell Perl to use
the german charset?


Thanks in advance
dennis
 
B

Ben Bacarisse

Dennis Winter said:
Hi NG,

all my attempts to solve this using perldoc and google failed..: :(

I installed the ActivePerl 5.8.4.810.msi on a W2kSP4-box. I installed
the POSIX-locale-module, Unicode-*, Charset...

If I try to

use POSIX;
use locale;
$| = 1;
setlocale POSIX::LC_COLLATE, "de_DE";
print "AÄä OÖö UÜü ssß";

I get some weird characters (hope I don?t crumple this post by copying
this characters in here...;)):

A-õ OÃ÷ U_³ ss¯

It is usually unhelpful to post such output because too many
components might have messed with it. I'd post a hex dump[1]. For
example on my system you program's output is:

00000000 41 c3 84 c3 a4 20 4f c3 96 c3 b6 20 55 c3 9c c3 |A.... O.... U...|
00000010 bc 20 73 73 c3 9f |. ss..|
00000016

Form this is can see that, as expected, the string is printed
literally. Your accented characters are UTF-8 encoded in the source,
and that is how they come out. In fact, because I have my terminal
character encoding set to UTF-8, you program prints: AÄä OÖö UÜü ssß
if I view the output directly.

What is the output on you system as a sequence of hex numbers?
E:\htdocs\acltree>perl -V:locale
locale='UNKNOWN';

The OS is running as a German environment. How can I tell Perl to use
the german charset?

OK, but how do you want these characters to be encoded, both in the
source and in the output? You seem to have chosen UTF-8 for the
source and if you want the same for the output you will need to have a
device that is expecting to see that encoding.

If you want some Unicode encoding (as seems likely) then you should
read perldoc perluniintro and perldoc perlunicode.
 
D

Dennis Winter

Hi Ben,

Ben said:

It is usually unhelpful to post such output because too many
components might have messed with it. I'd post a hex dump[1]. For
example on my system you program's output is:

00000000 41 c3 84 c3 a4 20 4f c3 96 c3 b6 20 55 c3 9c c3 |A.... O.... U...|
00000010 bc 20 73 73 c3 9f |. ss..|
00000016

Form this is can see that, as expected, the string is printed
literally. Your accented characters are UTF-8 encoded in the source,
and that is how they come out. In fact, because I have my terminal
character encoding set to UTF-8, you program prints: AÄä OÖö UÜü ssß
if I view the output directly.

Okay, thanks for that note...
[...]
The OS is running as a German environment. How can I tell Perl to use
the german charset?

OK, but how do you want these characters to be encoded, both in the
source and in the output? You seem to have chosen UTF-8 for the
source and if you want the same for the output you will need to have a
device that is expecting to see that encoding.

I´m creating textfiles from a PHP-frontend, each containing an absolute
path to a file on the filesystem. Normally PHP saves the file in
ASCII-mode, but if there are umlauts in name of the file, PHP saves the
textfile als UTF-8. Then, when I try to read the created textfiles using
perl, the umlauts are mutilated. Watching the textfile I can see that
the path itself has been saved correctly. But Perl seems to read the
file with a different charset.


Thanks you!
Dennis
 
D

Dennis Winter

Hi Ben,

Ben said:

It is usually unhelpful to post such output because too many
components might have messed with it. I'd post a hex dump[1]. For
example on my system you program's output is:

00000000 41 c3 84 c3 a4 20 4f c3 96 c3 b6 20 55 c3 9c c3 |A.... O.... U...|
00000010 bc 20 73 73 c3 9f |. ss..|
00000016

Form this is can see that, as expected, the string is printed
literally. Your accented characters are UTF-8 encoded in the source,
and that is how they come out. In fact, because I have my terminal
character encoding set to UTF-8, you program prints: AÄä OÖö UÜü ssß
if I view the output directly.

Okay, thanks for that note...
[...]
The OS is running as a German environment. How can I tell Perl to use
the german charset?

OK, but how do you want these characters to be encoded, both in the
source and in the output? You seem to have chosen UTF-8 for the
source and if you want the same for the output you will need to have a
device that is expecting to see that encoding.

I´m creating textfiles from a PHP-frontend, each containing an absolute
path to a file on the filesystem. Normally PHP saves the file in
ASCII-mode, but if there are umlauts in name of the file, PHP saves the
textfile als UTF-8. Then, when I try to read the created textfiles using
perl, the umlauts are mutilated. Watching the textfile I can see that
the path itself has been saved correctly. But Perl seems to read the
file with a different charset.

Meanwhile I added

binmode STDOUT, ':encoding(cp850)';
binmode STDIN, ':encoding(cp850)';

to the script. Being runned it says:

"\x{0084}" does not map to cp850 at actiontaker.pl line 226, <RJ> line 3.
!E:\Freigaben\FILESERVER\allshares\austausch\anke\Vertriebs und
Marketingans\x{0084}tze Everyone:(OI)(CI)F !


Thank you!
Dennis
 
D

Dennis Winter

Ben said:
That looks wrong. You say the files are UTF-8 encoded so why not give
':utf8' as the layer, at least for input?

I did that at the top of the script. That helped me with nearly every
appearance of Umlauts.

When I try to set binmode STDIN, STDOUT temporarily to utf-8 right
before the text file is read, the code

************* ORIGINAL ************
binmode STDIN, ':encoding(utf-8)';
binmode STDOUT, ':encoding(utf-8)';
open (RJ, $wdir.$readjob) || die "ERROR: could not read $wdir$readjob:
$!\n";
while (<RJ>) {
my $dir = $_;
*********** ORIGINAL END **********

produces

************* ORIGINAL ************
"\x{0084}" does not map to cp850 at actiontaker.pl line 227, <RJ> line 1.
!E:\Freigaben\FILESERVER\allshares\austausch\anke\Vertriebs und
Marketingans\x{0084}tze Everyone:(OI)(CI)F!
*********** ORIGINAL END **********

It appears to me that binmode can only be set once. Is there a way to
temporarily change the binmode for the time during the file is read? Or
even to determine which charset is used in the file an switch to that
"on the fly and back"?

Thanks
dennis
 
B

Ben Bacarisse

Dennis Winter said:
I did that at the top of the script. That helped me with nearly every
appearance of Umlauts.

When I try to set binmode STDIN, STDOUT temporarily to utf-8 right
before the text file is read, the code

************* ORIGINAL ************
binmode STDIN, ':encoding(utf-8)';
binmode STDOUT, ':encoding(utf-8)';
open (RJ, $wdir.$readjob) || die "ERROR: could not read $wdir$readjob:
$!\n";
while (<RJ>) {
my $dir = $_;
*********** ORIGINAL END **********

You seem to be reading from RJ. What has the IO encoding for STDIN
got to do with it?
produces

************* ORIGINAL ************
"\x{0084}" does not map to cp850 at actiontaker.pl line 227, <RJ> line 1.
!E:\Freigaben\FILESERVER\allshares\austausch\anke\Vertriebs und
Marketingans\x{0084}tze Everyone:(OI)(CI)F!
*********** ORIGINAL END **********

See above.
It appears to me that binmode can only be set once. Is there a way to
temporarily change the binmode for the time during the file is read?
Or even to determine which charset is used in the file an switch to
that "on the fly and back"?

That would probably be wrong. You can't tell be looking what encoding
is used. You need to know how the file is encoded and open it using
the right IO layer.

The form:

open(my $fh, "<:utf8", $fname);

is usually preferred. Read "perldoc perlopentut".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,816
Latest member
nipsseyhussle

Latest Threads

Top