Three questions: UTF-8, DBM, hash of lists, ...

W

Wes Groleau

I've been rooting around in perlutf8, perlencoding, perlunicode,
and other such things. I think I follow most of it, but there
are some contradictions. Or I thought there were.

1. At the moment, my source is pure ASCII, but I want to
treat it as UTF-8 because the text I work with is UTF-8
and my editor is configured accordingly. (And data
can easily become literals in source). I put -CSD on
my bang-line, which one man page said covers everything
(except -CL which I did not want for some reason). But
another man page seemed to say that "use utf8;" covered
something that -CSD did not, so I put that in, too. Is
either one interfering with the other in any way?

2. One of my applications is reading in a large file, finding
certain patterns, and using them as keys to store everything
else in a DBM hash (use DBM_File; dbmopen %hash, etc.)
The input is 99.5% ASCII--only a few French diacritics, one
copyright symbol, and two Polish characters. Yet adding
the utf-8 constructs to the script and regenerating the DBM
made a HUGE difference in the size of the file. Why is
that?

3. Say an input file contains key and value pairs, BUT
there is more than one possible value for a key.

For example, occupations.

Key Value
----------- ---------
firefighter Fred
chef Charlotte
firefighter Felicia

Can I store a list at the key, or do I have to append
to a string and split on output?

If I can store a list, what is the syntax? The following
is not allowed:


push (@the_hash{$the_job}, $the_name);


If the hash is tied with

use DBM_File;
dbmopen %the_hash .......

does that change the answer?


OK, more than three. :)
 
J

Jim Keenan

Wes said:
3. Say an input file contains key and value pairs, BUT
there is more than one possible value for a key.

For example, occupations.

Key Value
----------- ---------
firefighter Fred
chef Charlotte
firefighter Felicia

Can I store a list at the key, or do I have to append
to a string and split on output?

If I can store a list, what is the syntax? The following
is not allowed:


push (@the_hash{$the_job}, $the_name);
But wouldn't this be appropriate?

push @{$the_hash{$the_job}}, $the_name;


If the hash is tied with

use DBM_File;
dbmopen %the_hash .......

Shouldn't that be ...?

use DB_file;

Jim Keenan
 
A

Alan J. Flavell

Three questions

There are no special awards for folding several questions into one
posting. All that it achieves is: several unrelated subthreads
hanging-off the original posting. Confusion all round.

The key to effective problem-solving is to break up a complex problem
into manageable parts, and deal with each separately, until one
understands it well enough to use it at a component of the whole. In
that sense, I'd commend to you the strategy of asking detailed
questions one at a time (with enough context for the group to
understand the detailed question). If, on the other hand, you can't
decide how to partition a complex problem, then ask about the problem
itself, at a higher level, without pre-judging the lower-level
implementation detail. IMHO and YMMV, anyway.
I've been rooting around in perlutf8, perlencoding, perlunicode,
and other such things. I think I follow most of it, but there
are some contradictions. Or I thought there were.

1. At the moment, my source is pure ASCII, but I want to
treat it as UTF-8 because the text I work with is UTF-8
and my editor is configured accordingly.

Please distinguish carefully between your program source and your
data.

As a matter of fact, us-ascii -is- a subset of utf-8 - utf-8 was
deliberately designed that way - but you *don't* have to use utf-8
encoding in your program source in order to process unicode data.

In any case, Perl's unicode implementation is supposed to be
transparent, i.e you shouldn't normally need to know that its internal
representation happens to be utf-8. What you /do/ need to know is
what encoding is used in your /external data/, and to tell Perl about
it at the appropriate time (e.g by an encoding layer on an I/O
statement).
(And data can easily become literals in source).

In many situations, you might be better advised to write unicode
characters into the source by means of their \x{..} representation.
Which is not to deny that there can also be situations where you'd
want to write unicode characters directly - but then you have to be a
lot more careful with how you edit and transfer your source code.
See
http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Effects-of-Character-Semantics
for more details.
I put -CSD on
my bang-line, which one man page said covers everything
(except -CL which I did not want for some reason).

Could we have a cite on that?

-C is a request to use wide system calls. It doesn't influence Perl's
interpretation of your program source or data "as such".
But
another man page seemed to say that "use utf8;" covered
something that -CSD did not, so I put that in, too.

The perlunicode pod, for the version of Perl that you're using, should
be your "bible". Don't go tossing-in arbitrary bits and pieces that
you may have acquired from elsewhere - treat them as possibly
misleading clues, but check with the authoritative documentation to
make sure that they really do what you want.

See what
http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Important-Caveats
says about "use utf8;".
Is either one interfering with the other in any way?

I don't know of any reason why they should.

good luck
 
W

Wes Groleau

Alan said:
There are no special awards for folding several questions into one

No rewards expected or requested.
hanging-off the original posting. Confusion all round.

Welcome to Usenet.
Please distinguish carefully between your program source and your
data.

I did. When I said "source," I meant "source" and when
I said "text" I meant what you apparently call "data."
As a matter of fact, us-ascii -is- a subset of utf-8 - utf-8 was
deliberately designed that way - but you *don't* have to use utf-8
encoding in your program source in order to process unicode data.

I know that. However, I prefer that everything on my system
be interpreted as UTF-8, as I work with French, Spanish, Polish,
and Japanese. The script is all ASCII _now_ but I could add
literals for searching or whatever at any time.
In any case, Perl's unicode implementation is supposed to be
transparent, i.e you shouldn't normally need to know that its internal
representation happens to be utf-8. What you /do/ need to know is

I don't want to know what it does internally, as long as everything
comes out UTF-8 and is decoded as such going in.
what encoding is used in your /external data/, and to tell Perl about
it at the appropriate time (e.g by an encoding layer on an I/O
statement).

Since I want _everything_ UTF-8, the appropriate time
is (if possible) at the beginning of the script.
In many situations, you might be better advised to write unicode
characters into the source by means of their \x{..} representation.

My terminal renders the glyphs correctly when I 'cat' UTF-8.
Why should I have to look up the codes every time instead?
And although I can compose characters in hex, why should
I do that instead of cut-and-paste from the editor?
Which is not to deny that there can also be situations where you'd
want to write unicode characters directly - but then you have to be a
lot more careful with how you edit and transfer your source code.
See
http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Effects-of-Character-Semantics
for more details.

Yes, I read that. I'm trying to minimize the need for "being careful"
about all those ten zillion details by specifying "everything is UTF-8."
-C is a request to use wide system calls. It doesn't influence Perl's
interpretation of your program source or data "as such".

You're right:

man perlrun
.....

As of 5.8.1, the "-C" can be followed either by a number or a list
of option letters. The letters, their numeric values, and effects
are as follows; listing the letters is equal to summing the numbers.

I 1 STDIN is assumed to be in UTF-8
O 2 STDOUT will be in UTF-8
E 4 STDERR will be in UTF-8
S 7 I + O + E
i 8 UTF-8 is the default PerlIO layer for input streams
o 16 UTF-8 is the default PerlIO layer for output streams
D 24 i + o

Seems to say -CSDA should handle all my IO (I left off the A because
I still have a little bit of resistance to overcome from the shell)
except for the script itself. A detail I missed. Not an issue yet,
but I'd like to fix it before it becomes one.
The perlunicode pod, for the version of Perl that you're using, should
be your "bible". Don't go tossing-in arbitrary bits and pieces that

I have 5.8.1 but no pod, so my 'elsewhere' is the man pages
derived from the pod.

It says the same as my man page: that the pragma is needed
to "enable UTF-8" in scripts. It doesn't say whether
"enable" means the script itself or the IO or both.
However, 'man perlrun' says the -CSD handles the IO,
and perlunicode says for script encoding, see encoding
which says that UTF-8 already works in scripts.

So, things are a little unclear. I put in both, and
was able to read UTF-8 text, put it in a DBM hash, and
get it back out. That's good enough for now.
 
A

Alan J. Flavell

Welcome to Usenet.

Indeed. It seems from your response, and the rarity of responses from
other contributors, that you're in the position to offer us all a
valuable tutorial on the topic.
I don't want to know what it does internally, as long as everything
comes out UTF-8 and is decoded as such going in.

Fine, then we're pretty much up to speed already, and I'm sorry that I
misinterpreted your original posting.
Yes, I read that. I'm trying to minimize the need for "being
careful" about all those ten zillion details by specifying
"everything is UTF-8."

Point made. If you're really in control of all that data then you're
in a much happier position than I've ever been ;-)
I 1 STDIN is assumed to be in UTF-8
O 2 STDOUT will be in UTF-8
E 4 STDERR will be in UTF-8
S 7 I + O + E
i 8 UTF-8 is the default PerlIO layer for input streams
o 16 UTF-8 is the default PerlIO layer for output streams
D 24 i + o

Seems to say -CSDA should handle all my IO

It does, doesn't it? Did I miss the specific problem you were having,
and your test case that demonstrated it?
I have 5.8.1 but no pod, so my 'elsewhere' is the man pages
derived from the pod.

No disagreement there. More than one way to...read the documentation.
It says the same as my man page: that the pragma is needed
to "enable UTF-8" in scripts.

Hmmm? At 5.8.4 (and I don't remember it being different in recent
versions before that) it says [this'll need monospace display, and go
sadly wrong with these newfangled usenet-ish interfaces, sorry]:

As a compatibility measure, the use utf8 pragma must be explicitly
included to enable recognition of UTF-8 in the Perl scripts
^^^^^^^^^^^^^^^^^^^
themselves (in string or regular expression literals, or in
^^^^^^^^^^
identifier names) on ASCII-based machines or to recognize UTF-EBCDIC
on EBCDIC-based machines. These are the only times when an explicit
^^^^^^^^^^
use utf8 is needed.
However, 'man perlrun' says the -CSD handles the IO,

Indeed, and (fwiw) I don't see anything there about encoding of the
script's source code itself.
and perlunicode says for script encoding, see encoding
which says that UTF-8 already works in scripts.

It "works", yes, but (as I understand it, anyway) I think you have to
ask for it. It could just be that if you call for locale-awareness
with -CL, and you have utf-8 in your locale, it will come out in the
wash; but I don't see any harm in asking for it directly, if you're so
certain that you'll never not want it (sorry for the double-negative).
So, things are a little unclear. I put in both,

Looks as if you're (a) right and (b) unlikely to cause any harm.
was able to read UTF-8 text, put it in a DBM hash, and
get it back out. That's good enough for now.

Good luck
 
W

Wes Groleau

Alan J. Flavell wrote:
[re UTF-8 in perl scripts]
It "works", yes, but (as I understand it, anyway) I think you have to
ask for it. It could just be that if you call for locale-awareness
with -CL, and you have utf-8 in your locale, it will come out in the
wash; but I don't see any harm in asking for it directly, if you're so
certain that you'll never not want it (sorry for the double-negative).

I also left the L off of -C because I don't think I have that completely
coerced to UTF-8
Looks as if you're (a) right and (b) unlikely to cause any harm.

Sigh, now it starts getting weird. Kind of long, summary at the bottom.

The script with -CSD and use utf8 created a database,
and a test script pulled the records out of the database
and printed them. The non-ASCII characters rendered
correctly BUT that doesn't mean anything, since the test
script had the same -CSD and use utf8. (Right?)

So I figured I needed to eyeball inside the DB file
and see if I could find some nonASCII and see how it was encoded.

But a series of unfortunate events resulted in my having
to re-create the script, and then it crashed (bus error
or segmentation fault). Figured out which record it
was crashing on, put it in its own file, and ....
well to skip over the long tedious details, I eventually
had a version of the script that would crash and one that
would not crash on the same input file.

'diff' showed only one difference:

wgroleau$ diff ~/bin/GEDCOM_DB ./tempGCDB
1c1
< #!/usr/bin/perl -w -CSD
---
#!/usr/bin/perl -w -CSD

od -xc revealed that the extra space is indeed a (hex 20)
regular space and not a UTF-8 construct.

More study showed that the space made a difference on the only
two systems I currently have access to:

wgroleau$ uname -a
Darwin Groleau.local 7.7.0 Darwin Kernel Version 7.7.0: Sun Nov 7
16:06:51 PST 2004; root:xnu/xnu-517.9.5.obj~1/RELEASE_PPC Power
Macintosh powerpc
wgroleau$ perl -v

This is perl, v5.8.1-RC3 built for darwin-thread-multi-2level
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2003, Larry Wall

AND

[0:ag/g/groleau> uname -a
NetBSD otaku 1.6.2_STABLE NetBSD 1.6.2_STABLE (sdf) #0: Sun Jul 25
04:17:09 UTC 2004 root@ol:/var/src/src/sys/arch/alpha/compile/sdf alpha

[0:ag/g/groleau> perl -v

This is perl, v5.8.0 built for alpha-netbsd

Copyright 1987-2002, Larry Wall


On Darwin/PPC, the extra space prevents bus error/segmentation fault.
On Net-BSD/Alpha, it prevents the following:

[0:ag/g/groleau> rm wgroleau.DB; ./tempGCDB < bad.record.GED
Recompile perl with -DDEBUGGING to use -D switch
Can't emulate -S on #! line at ./tempGCDB line 1.
[255:ag/g/groleau> head -1 ./tempGCDB
#!/usr/pkg/bin/perl -w -CSD


Summary: On two diferent platforms, in

#!/usr/bin/perl -w -CSD

the extra space is required.

If anyone wants to try it on a different system, I can provide
the script and the input file.

--
Wes Groleau
-----------

"Thinking I'm dumb gives people something to
feel smug about. Why should I disillusion them?"
-- Charles Wallace
(in _A_Wrinkle_In_Time_)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,820
Latest member
GilbertoA5

Latest Threads

Top