Creating UNICODE filenames with PERL 5.8

A

Allan Yates

I have been having distinct trouble creating file names in PERL
containing UNICODE characters. I am running ActiveState PERL 5.8 on
Windows 2000.

For a simple test, I picked a UNICODE character that could be
displayed by Windows Explorer. I can select the character(U+0636) from
'charmap' and cut/paste into a filename on Windows Explorer and the
character displays the same as it does in 'charmap'. This proves that
I have the font available.

When I attempt to create the same filename with PERL, I end up with a
filename two characters long: ض

I somebody could point me in the correct direction, I would very much
appreciate it. I have read the UNICODE documents included with PERL as
well searching the newgroups and the web, and everything appears to
indicate this should work.

Perl program:

$name = chr(0x0636);

if (!open(FILE,">uni_names/$name")) {
print STDERR "Could not open ($!): $name\n";
}

close (FILE);


Thanks,

Allan.
a y a t e s a t s i g n i a n t d o t c o m
 
A

Alan J. Flavell

I have been having distinct trouble creating file names in PERL
containing UNICODE characters. I am running ActiveState PERL 5.8 on
Windows 2000.

N.B I have limited expertise in this specific area, but some of the
locals around here seem to look to me to answer Unicode questions of
any kind, so I'll give it a try, as long as you take the answers with
the necessary grains of salt...

First important question is - have you set the option for wide
character API in system calls?
For a simple test, I picked a UNICODE character that could be
displayed by Windows Explorer. I can select the character(U+0636) from

that'd be Arabic letter DAD, right?

Its utf-8 representation will be two octets: 0xd8, 0xb6.
'charmap' and cut/paste into a filename on Windows Explorer and the
character displays the same as it does in 'charmap'. This proves that
I have the font available.

(I think that's the least of your worries at the moment...)
When I attempt to create the same filename with PERL, I end up with a
filename two characters long: ض

Those look like 0xd8 and 0xb6 to me...

At a quick glance, I suspect we are seeing the pair of octets that
represent the character in utf-8 (Perl's internal representation)
rather than as what Win32 would use, which AIUI is utf-16LE (which in
this case would come out as 0x3606, IINM). However, I'm not sure that
(other than for diagnostic purposes) you should ever need to tangle
with it in that form, since Perl ought to know what to do in a (wide)
system call.

The system call is evidently treating them as two one-byte characters,
hence my question about wide system calls. Look for the reference to
wide system calls in the perlrun page, and the other references to
which it links.
I somebody could point me in the correct direction, I would very much
appreciate it. I have read the UNICODE documents included with PERL as

OK, but there are also some Win32-specific documents/web-pages that
come with the ActivePerl distribution. In some situations they might
be just what you need.
well searching the newgroups and the web, and everything appears to
indicate this should work.

If the above is not the answer, then maybe Win32API::File has
something for you - but I've never been there myself, so don't pay too
much attention to that.
Perl program:

But did you start it with the -C option, or set the wide system calls
thingy? I think that may prove to be the key.

Good luck, and please report your findings.
 
B

Ben Morrow

I have been having distinct trouble creating file names in PERL

Perl or perl, not PERL.
containing UNICODE

I'm not so sure about UNICODE...
For a simple test, I picked a UNICODE character that could be
displayed by Windows Explorer. I can select the character(U+0636) from
'charmap' and cut/paste into a filename on Windows Explorer and the
character displays the same as it does in 'charmap'. This proves that
I have the font available.

When I attempt to create the same filename with PERL, I end up with a
filename two characters long: ض

OK, your problem here is that Win2k is being stupid about Unicode: any
sensible OS that understood UTF8 would be fine :). My guess would be
that Windows stores filenames in utf16 with a BOM, and if it doesn't
find a BOM it assumes ASCII/'Windows ANSI'... so try this:

use Encode;
$name = chr(0x0636);

$name = encode "utf16", $name;
if (!open(FILE,">uni_names/$name")) {
print STDERR "Could not open ($!): $name\n";
}

close (FILE);

If that works, then we could really do with an addition to the 'open'
pragma to do it for you: use open NAMES => "utf16";... hmmm.

If it fails, delete your file in uni_names and create one by
copy/pasting that character out of charmap. Then run

#!/usr/bin/perl

use warnings;
use bytes;

opendir my $U, "uni_names";
my @n = readdir $U;
$, = $\ = "\n";
print map { "$_: " . join ' ', map { ord } split // } @n;

__END__

and tell me what it says.

Ben
 
A

Allan Yates

The key was the missing "-C". I didn't clue in from the documentation
that this was important. Once I added that command line parameter, the
file was created with the correct name.

My next step was to read the file name from the directory. However, I
thought I read in some documentation somewhere that 'readdir' is not
UNICODE aware. I seemed to prove this by reading the directory
containing the file I just created. It comes back with a two character
file name that 'ord' into 0xd8 and 0xb6 as you indicated.

Do you know of a method of reading directories to get the UNICODE file
names?


Thanks,

Allan.
 
A

Allan Yates

But

You are correct that unicode is not an acronym and should not be
capitalised. My deepest apologies for offending you through the use of
my grammer. I was not aware that grammer police were covering this
newsgroup. PERL is an acronym, "Practical Extraction and Report
Language", and thus may be capitalised.


Allan.

P.S. Please don't even think of chastising me for top posting versus
bottom posting. Different people have different preferences.

P.P.S. For the people who have ignored my grammer and helped me in my
quest, I am very appeciative.
 
B

Ben Morrow

You are correct that unicode is not an acronym and should not be
capitalised. My deepest apologies for offending you through the use of
my grammer. I was not aware that grammer police were covering this
newsgroup.

'Grammar police' cover every ng worth having, the reason being that it
is very much easier to understand people when their spelling/grammar/
punctuation is correct.
PERL is an acronym, "Practical Extraction and Report Language", and
thus may be capitalised.

Nope, it isn't. from perlfaq1:

| But never write "PERL", because perl is not an acronym, apocryphal
| folklore and post- facto expansions notwithstanding.
P.S. Please don't even think of chastising me for top posting versus
bottom posting. Different people have different preferences.

No they don't. Only idiots prefer top-posting.

*PLONK*

Ben
 
M

Malcolm Dew-Jones

Ben Morrow ([email protected]) wrote:
: (e-mail address removed) (Allan Yates) wrote:
: > I have been having distinct trouble creating file names in PERL

: Perl or perl, not PERL.

: > containing UNICODE

: I'm not so sure about UNICODE...

: > For a simple test, I picked a UNICODE character that could be
: > displayed by Windows Explorer. I can select the character(U+0636) from
: > 'charmap' and cut/paste into a filename on Windows Explorer and the
: > character displays the same as it does in 'charmap'. This proves that
: > I have the font available.
: >
: > When I attempt to create the same filename with PERL, I end up with a
: > filename two characters long: ض

: OK, your problem here is that Win2k is being stupid about Unicode: any
: sensible OS that understood UTF8 would be fine :).

Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
the simple expedient of using 16 bit characters. It is hardware that is
stupid, by continuing to use ancient tiny 8 bit elementary units.

Imagine if all that hardware still used 16 or 24 bit memory addresses.
Imagine if all our communication and hardware backbones still actually
transmitted data in single digit bit sizes.

Character size was always a compromise between functionality and memory.
Character size continually increased from the first character manipulating
electronic equipment of the (gee, way way back 1930's or so, believe it or
not) until the 1980's, when it suddenly solidified into a standard
elementary unit that was still a compromise in terms of size, but is now
clearly too small.

Character size remains frozen due to one of murphy's laws regarding the
success of hardware first build using compromises that were appropriate
twenty years ago.
 
B

Ben Morrow

Ben Morrow ([email protected]) wrote:
: OK, your problem here is that Win2k is being stupid about Unicode: any
: sensible OS that understood UTF8 would be fine :).

Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
the simple expedient of using 16 bit characters. It is hardware that is
stupid, by continuing to use ancient tiny 8 bit elementary units.

OK, I invited that with gratuitous OS-bashing :)... nevertheless:

1. Unicode is *NOT* a 16-bit character set. UTF16 is an evil bodge to
work around those who started assuming it was before the standards
were properly in place.

2. Given that the world does, in fact, use 8-bit bytes, any 16-bit
encoding has this small problem of endianness... again, solved
(IMHO) less-than-elegantly by the Unicode Consortium.

3. Given that the most widespread character set is likely to be either
ASCII or Chinese ideograms, and ideograms won't fit into less than
16 bits anyway, it seems pretty silly to encode a 7-bit charset
with 16 bits per character.

4. It also seems pretty silly to break everything in the world that
relies on a byte of 0 meaning end-of-string, not to mention '/'
being '/' (or '\', or whatever, as appropriate).

et cetera

Ben
 
T

Tad McClellan

Allan Yates said:
PERL is an acronym,


No it isn't smarty pants.

P.S. Please don't even think of chastising me for top posting versus
bottom posting. Different people have different preferences.


No chastisment, just ignoration in perpetuity.

*plonk*
 
T

Tassilo v. Parseval

Also sprach Allan Yates:
P.S. Please don't even think of chastising me for top posting versus
bottom posting. Different people have different preferences.

Right. And unless you write those articles solely for yourself, the
preferences of your readers count and not yours. So stop top-posting or
the regulars will stop reading your posts.

Tassilo
 
A

Anno Siegel

Allan Yates said:
But

You are correct that unicode is not an acronym and should not be
capitalised. My deepest apologies for offending you through the use of
my grammer. I was not aware that grammer police were covering this
newsgroup.

Grammar. And grammar isn't the problem, spelling is.
PERL is an acronym, "Practical Extraction and Report
Language", and thus may be capitalised.

Nope. That was retro-fitted.
Allan.

P.S. Please don't even think of chastising me for top posting versus
bottom posting. Different people have different preferences.

Complaining about grammar police and playing thought police?
P.P.S. For the people who have ignored my grammer and helped me in my
quest, I am very appeciative.

Translation: "Others can **** off". I think you got what you want.

Anno
 
A

Alan J. Flavell

Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
the simple expedient of using 16 bit characters.

....which unfortunately turns out to be somewhat of a mistake, seeing
that Unicode went and broke the 16-bit boundary.
It is hardware that is
stupid, by continuing to use ancient tiny 8 bit elementary units.

utf-8 is the closest they managed to get to variable-length character
encoding. It's not perfect, but it gets around quite a lot of the
compatibility problems that exist with other approaches.
Imagine if all that hardware still used 16 or 24 bit memory addresses.

Imagine if every us-ascii character were required to occupy 64 bits?
And then there's legacy data to think about.
Character size was always a compromise between functionality and memory.
Agreed.

Character size continually increased from the first character manipulating
electronic equipment of the (gee, way way back 1930's or so, believe it or
not)

Interestingly, those early codes regularly had shift-in and shift-out
codes to extend their repertoire. A practice which faded out for a
while, almost got reborn in a big way in ISO-2022, and then -
iso-10646/Unicode and associated encodings. I wonder what the future
holds in store? ;-)
Character size remains frozen due to one of murphy's laws regarding the
success of hardware first build using compromises that were appropriate
twenty years ago.

It's easy to poke fun, but it's harder to come up with a viable
compromise IMHO.

all the best
 
A

Alan J. Flavell

I don't think so. O.P could show appreciation by trying to fit in
with the conventions of Usenet, and participate with the sharing.
Spitting in the group's collective face is no way to show one's
appreciation, that's for sure.
Translation: "Others can **** off".

I took the hint, too.
I think you got what you want.

I guess he did, just the once. Well, I hope more-perceptive others
can learn from his mistakes, and I mean not only in terms of technical
content but also in terms of newsgroup interaction.

So much for pot luck.
 
B

Ben Liddicott

Some history required...


Ben Morrow said:
OK, I invited that with gratuitous OS-bashing :)... nevertheless:

1. Unicode is *NOT* a 16-bit character set. UTF16 is an evil bodge to
work around those who started assuming it was before the standards
were properly in place.

Unicode 1.0 WAS a 16-bit character set. So there. UTF16 is a representation
of Unicode 3.0 which is selected to be backwards compatible with Unicode
1.0.

The reason why NT doesn't use UTF-8 is that --- wait for it --- it wasn't
invented back then. UTF-8 was specified in 1993, and adopted as an ISO
standard in 1994. Windows NT shipped in 1993, after 5 years in development.
Guess what: Decision on character set had to be made in the eighties.

Yes, they got it wrong. They should have selected UTF-8. They should have
INVENTED UTF-8.

So you can knock them for not having the foresignt to know that 65535
characters wouldn't be enough. That's a mistake a lot of people made, and
with hindsight it is unaccountable: It required a concious decision to
exclude uncommon characters. The best explanation I have heard for why this
is wrong: "An uncommon character is a common character if it is your name,
or the name of the place where you live".

But don't knock them for not using UTF-8. Clearly anyone designing an OS now
would use UTF-8, of course.

Cheers,
Ben Liddicott
 
B

Ben Morrow

Ben Liddicott said:
Some history required...




Unicode 1.0 WAS a 16-bit character set. So there. UTF16 is a representation
of Unicode 3.0 which is selected to be backwards compatible with Unicode
1.0.

OK. This doesn't stop it being completely wrong. Given the choice
between breaking compatibility with the few people who implemented
Unicode 1.0, breaking compatibility with everyone else who was still
assuming everything was a superset of ASCII and creating seven[1]
different, incompatible representations of the supposed answer to
character encoding problems it is fairly clear to me at least which is
the right answer.

Not to mention that, because of the endianness problem, ucs-2 was
broken as an encoding from the start.

[1] utf8, utf16 BE, LE and with BOM, utf32 ditto.
So you can knock them for not having the foresignt to know that 65535
characters wouldn't be enough.

I can also knock them for not having changed in the ten years since
NT3.5 was released. It is not *that* difficult a change to implement,
as Perl 5.8 has demonstrated; even though it has some nasty bits,
ditto.

Ben
 
A

Alan J. Flavell

Guess what: Decision on character set had to be made in the eighties.

Yeah: as far as I recall, IBM invented DBCS EBCDIC. Doubtless a fine
standard for its time. But things move on.
 
B

Ben Liddicott

Probably your best bet is to try to use Unicode::String to convert the file
names to utf-8. It is obviously reading the filenames using the Unicode API,
(otherwise you would get REPLACEMENT CHARACTER instead), but not recognising
that it has done so.

Alternatively, with Win32::API you can use Win32 FindFirstFileW,
FindNextFileW, FindCloseW. This should be pretty much guaranteed to work.

Alternatively you can see if File::Find works, though I suspect it may
suffer the same problems.

Alternatively again, you can try spawning a cmd shell, and parsing the
output. This is only going to be any good if ${^WIDE_SYSTEM_CALLS} affects
qx() or open("command |"), and I don't know if it does or not.

If you specify /u to cmd.exe, it sets the console output to UTF-16, which
you could convert back by hand, using Unicode::String. I'm not entirely sure
how one could send unicode in through $sDirName, though. Experimentation may
tell you.

# /u means unicode, /c means run command and exit
my $sDirCommand = qq(cmd.exe /u /c dir /a "$sDirName");
my $fh = new IO::File($sDirCommand);

Cheers,
Ben Liddicott
 
B

Ben Morrow

[stop top-posting]


Note that the functionality of -C no longer exists under 5.8.1, and
perl581delta claims it didn't work under 5.8.0 either.
Probably your best bet is to try to use Unicode::String to convert the file
names to utf-8. It is obviously reading the filenames using the Unicode API,
(otherwise you would get REPLACEMENT CHARACTER instead), but not recognising
that it has done so.

No. The right answer is to use Encode::decode to convert *from* utf16.
Alternatively you can see if File::Find works, though I suspect it may
suffer the same problems.

Why don't you look? A quick grep through perldoc -m File::Find shows
that the names come straight out of readdir, so yes, it will suffer
exactly the same problems.
Alternatively again, you can try spawning a cmd shell, and parsing the
output. This is only going to be any good if ${^WIDE_SYSTEM_CALLS} affects
qx() or open("command |"), and I don't know if it does or not.

Bleech. And no, -C will have no effect on this; rather, it will be
affected by the PerlIO layers pushed onto the filehandle.
If you specify /u to cmd.exe, it sets the console output to UTF-16, which
you could convert back by hand, using Unicode::String. I'm not entirely sure
how one could send unicode in through $sDirName, though.

Either -C will use a Unicode-aware pipe-opening API, and it will Just
Work, or use Encode::encode to encode it into whatever Windows expects
command lines to be specified in, probably utf16.

Ben
 
M

Malcolm Dew-Jones

Alan J. Flavell ([email protected]) wrote:
: On Wed, 18 Nov 2003, Malcolm Dew-Jones wrote:

: > Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
: > the simple expedient of using 16 bit characters.

: ...which unfortunately turns out to be somewhat of a mistake, seeing
: that Unicode went and broke the 16-bit boundary.

Which was also a mistake. "character" now includes all the heiroglyphics
of places like china, (but why not all the heiroglyphics of, say, ancient
egypt? why not all the standardized international road symbols?). When
the arabians invented the modern idea of characters then it became widely
recognized as much more powerful, fundamentally better, and fundamentally
"different" than the old single-picture-means-a-word method of writing.
Now we have jumped backwards 1800 years. Things like chinese writing
should not be treated using standardized application level encodings, just
as we now standarize many markup languages for encoding other higher level
data. ($0.02)

: > It is hardware that is
: > stupid, by continuing to use ancient tiny 8 bit elementary units.

: utf-8 is the closest they managed to get to variable-length character
: encoding. It's not perfect, but it gets around quite a lot of the
: compatibility problems that exist with other approaches.

: > Imagine if all that hardware still used 16 or 24 bit memory addresses.

: Imagine if every us-ascii character were required to occupy 64 bits?

First, it would never be 64 bits for a character. Even if we hardcoded
current unicode values, it would be no more than 24 bits per character.

That's three (or two at 16 bits) times the space, which for the vast
majority of users would be irrelevent anyway due to the enourmous increase
in storage capacities.

Also, it is almost a norm to store any static data in compressed format,
and compression tools would utilize the larger character size to pack more
data, so the total storage space required for a lot of data would not
increase.

Things that would truly be affected, such as humungous databases, already
have to use many mechanisms to be able to manipulate the data, and I'm
sure they could find ways to handle the larger volumes, probably by using
the exact reverse of wide characters.

: And then there's legacy data to think about.

stored on legacy systems, and manipulated using legacy software and
hardware.

This is murhpy's law. Because the old systems have been successful, new
systems can't be made better.

: > Character size was always a compromise between functionality and memory.

: Agreed.

: > Character size continually increased from the first character manipulating
: > electronic equipment of the (gee, way way back 1930's or so, believe it or
: > not)

: Interestingly, those early codes regularly had shift-in and shift-out
: codes to extend their repertoire. A practice which faded out for a
: while,

yes, as soon as hardware costs made larger characters possible, they got
rid of the gludginess.

almost got reborn in a big way in ISO-2022, and then -
: iso-10646/Unicode and associated encodings. I wonder what the future
: holds in store? ;-)

: > Character size remains frozen due to one of murphy's laws regarding the
: > success of hardware first build using compromises that were appropriate
: > twenty years ago.

: It's easy to poke fun, but it's harder to come up with a viable
: compromise IMHO.

I am out of time, to say more.
 
B

Ben Morrow

Alan J. Flavell ([email protected]) wrote:
: On Wed, 18 Nov 2003, Malcolm Dew-Jones wrote:

: > Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
: > the simple expedient of using 16 bit characters.

: ...which unfortunately turns out to be somewhat of a mistake, seeing
: that Unicode went and broke the 16-bit boundary.

Which was also a mistake. "character" now includes all the heiroglyphics
of places like china, (but why not all the heiroglyphics of, say, ancient
egypt?
Proposed.

why not all the standardized international road symbols?

I see no reason why these should not also be added.
). When
the arabians invented the modern idea of characters then it became widely
recognized as much more powerful, fundamentally better, and fundamentally
"different" than the old single-picture-means-a-word method of writing.
Now we have jumped backwards 1800 years.

I think this is a little arrogant, to say the least. Chinese ideograms
(which are not the same as hieroglyphs) have served the need of the
Chinese admirably: two people from opposite ends of the country,
speaking mutually un-intelligible languages, can nevertheless
communicate perfectly through the existence of a common form of
writing.

Apart from that, one of the basic reasons for inventing 'other' character
encodings was so that one could write his own name without resorting
to markup. There are an awful lot of people whose names require
Chinese ideograms to spell...

Note that I do not disagree with you that many of the choices about
what is 'in' and what 'out' of Unicode seem more than a little
arbitrary... :)
Things like chinese writing
should not be treated using standardized application level encodings, just
as we now standarize many markup languages for encoding other higher level
data. ($0.02)

I'm afraid I don't follow what you mean here.
: And then there's legacy data to think about.

stored on legacy systems, and manipulated using legacy software and
hardware.

This is murhpy's law. Because the old systems have been successful, new
systems can't be made better.

They can, and are being. Intelligence just needs to be applied at
every stage. A case in point: utf8 both keeps legacy compatibility
*and* is more extensible than ucs2.
: Interestingly, those early codes regularly had shift-in and shift-out
: codes to extend their repertoire. A practice which faded out for a
: while,

yes, as soon as hardware costs made larger characters possible, they got
rid of the gludginess.

Agreed, shifting is nasty and has serious problems, such as getting
out of sync. UTF16 surrogates, though, are pure eeevilll.....

Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top