W
Wes Groleau
I've been rooting around in perlutf8, perlencoding, perlunicode,
and other such things. I think I follow most of it, but there
are some contradictions. Or I thought there were.
1. At the moment, my source is pure ASCII, but I want to
treat it as UTF-8 because the text I work with is UTF-8
and my editor is configured accordingly. (And data
can easily become literals in source). I put -CSD on
my bang-line, which one man page said covers everything
(except -CL which I did not want for some reason). But
another man page seemed to say that "use utf8;" covered
something that -CSD did not, so I put that in, too. Is
either one interfering with the other in any way?
2. One of my applications is reading in a large file, finding
certain patterns, and using them as keys to store everything
else in a DBM hash (use DBM_File; dbmopen %hash, etc.)
The input is 99.5% ASCII--only a few French diacritics, one
copyright symbol, and two Polish characters. Yet adding
the utf-8 constructs to the script and regenerating the DBM
made a HUGE difference in the size of the file. Why is
that?
3. Say an input file contains key and value pairs, BUT
there is more than one possible value for a key.
For example, occupations.
Key Value
----------- ---------
firefighter Fred
chef Charlotte
firefighter Felicia
Can I store a list at the key, or do I have to append
to a string and split on output?
If I can store a list, what is the syntax? The following
is not allowed:
push (@the_hash{$the_job}, $the_name);
If the hash is tied with
use DBM_File;
dbmopen %the_hash .......
does that change the answer?
OK, more than three.
and other such things. I think I follow most of it, but there
are some contradictions. Or I thought there were.
1. At the moment, my source is pure ASCII, but I want to
treat it as UTF-8 because the text I work with is UTF-8
and my editor is configured accordingly. (And data
can easily become literals in source). I put -CSD on
my bang-line, which one man page said covers everything
(except -CL which I did not want for some reason). But
another man page seemed to say that "use utf8;" covered
something that -CSD did not, so I put that in, too. Is
either one interfering with the other in any way?
2. One of my applications is reading in a large file, finding
certain patterns, and using them as keys to store everything
else in a DBM hash (use DBM_File; dbmopen %hash, etc.)
The input is 99.5% ASCII--only a few French diacritics, one
copyright symbol, and two Polish characters. Yet adding
the utf-8 constructs to the script and regenerating the DBM
made a HUGE difference in the size of the file. Why is
that?
3. Say an input file contains key and value pairs, BUT
there is more than one possible value for a key.
For example, occupations.
Key Value
----------- ---------
firefighter Fred
chef Charlotte
firefighter Felicia
Can I store a list at the key, or do I have to append
to a string and split on output?
If I can store a list, what is the syntax? The following
is not allowed:
push (@the_hash{$the_job}, $the_name);
If the hash is tied with
use DBM_File;
dbmopen %the_hash .......
does that change the answer?
OK, more than three.