troubles with unicode (incorrect sorting and basic understandingproblems)

P

peter pilsl

sorry, but I dont get this unicode-thingy right. I dont even know if it is
a perl-problem, cause I've three applications interacting.

The task is to enter text via a webinterface, store it to a sql-database
(postgres) and print it out to a webpage again. The text can be anything
from russian, german, english, spanish ...

Everything seems to run ok: The text entered in the webinterface is
processed in perl correct (The unicode-chars appears as two bytes in the
string) and stored in the database correct (as two-byte again). Even the
way back works like a charm and all the text is printed out correct.

The problem starts when doing the sorting.
If I let the SQL-database do the sorting I get all "exotic" chars sorted
wrong. (german umlaut-O is between A and B ...). I'll ask the
postgres-people about this.
If I let perl-sort do the job I get all the "exotics" at the end. (umlaut-O
is after Z)

I expect german umlaut-O to occure right between O and P.

Of course I could implement my own sorting-algorithm that deals with these
special problems, but this would slow down things and I dont think I should
do it, cause perl should be able to do it.

I think the problem (with sql-sort and perl-sort) is that obviously only
the first byte of the two-byte is taking into account when sorting. Is this
because I do something wrong or should to something or is this a very
common problem ?

To illustrate my problem, I put a small sample-script online:

http://www.goldfisch.at/cgi-bin/unicodetest4.pl

if you enter some text, the text will be inserted in the database and all
existing entries will be printed out, sorted by perl.

see the source at http://www.goldfisch.at/test/unicodetest4.txt


I've read about Unicode::Collate to change the collating/sorting-behaviour,
but I didnt get any clue how to use this to make "default"-latin sorting ..

any help is appretiated ..

thnx,
peter
 
J

Jürgen Exner

peter pilsl wrote:

Can't help you much, but ...
The task is to enter text via a webinterface, store it to a
sql-database (postgres) and print it out to a webpage again. The text
can be anything from russian, german, english, spanish ...

Everything seems to run ok: The text entered in the webinterface is
processed in perl correct (The unicode-chars appears as two bytes in
the string) and stored in the database correct (as two-byte again).

So I guess you are talking about UTF-16 then, right?
"Unicode" by itself is not very precise because it has 3 defining encodings
now, so you better say which one you are using.
Even the way back works like a charm and all the text is printed out
correct.

The problem starts when doing the sorting.
If I let the SQL-database do the sorting I get all "exotic" chars
sorted wrong. (german umlaut-O is between A and B ...). I'll ask the
postgres-people about this.
If I let perl-sort do the job I get all the "exotics" at the end.
(umlaut-O is after Z)
I expect german umlaut-O to occure right between O and P.

Careful!
That would be correct for German, but for let's say Swedish all umlauts are
sorted lexicographically after the Z, i.e. Perl happens to do the right
thing for Swedish.
In other words: the sorting order for extended characters depends on the
locale of the text.
Maybe it will be sufficient to just tell Perl which locale to use for this
sort, but I've never tried it myself.
[...] but I didnt get any clue how to use this
to make "default"-latin sorting ..

There is no such thing as a default sorting order for extended characters.

jue
 
A

Alan J. Flavell

sorry, but I dont get this unicode-thingy right. I dont even know if it is
a perl-problem, cause I've three applications interacting.

Then I'd have to recommend separating the parts and
understanding each component separately, to at least the level
necessary to understand how they interface to each other.
The task is to enter text via a webinterface, store it to a sql-database
(postgres) and print it out to a webpage again. The text can be anything
from russian, german, english, spanish ...

Chinese? Arabic?
Everything seems to run ok: The text entered in the webinterface is
processed in perl correct (The unicode-chars appears as two bytes in the
string)

I don't know quite how to say this, but since you give the impression
that you aren't clear about what's going wrong, how can you be in a
position to tell us that this particular part is "correct"? Perl's
native Unicode representation uses utf-8, in which characters occupy
one, two, three or more bytes, not exactly two. What character
representation are you using at this point that you get exactly two?
If I let perl-sort do the job I get all the "exotics" at the end. (umlaut-O
is after Z)

Have you said anything to Perl about what language locale it's meant
to be using? Sorting order is different for different languages.

Are you allowing Perl to work with Unicode characters, or is it
working at the byte level? I looked briefly at your source code but
to be honest I wasn't able to answer that.
I think the problem (with sql-sort and perl-sort) is that obviously only
the first byte of the two-byte is taking into account when sorting.

I deduce from this that you are trying to work at the byte level,
with your characters occupying pairs of bytes? That is _not_ Perl's
own intenal character representation.

I think my first move would be to factor-out your database stuff.
It's only acting as a temporary store for your data, after all, and
as a potential cause of extraneous confusion, as far as I can see.
 
P

peter pilsl

Alan J. Flavell wrote:

<skip>

Ok - I tried to reduce the problem and wrote a small script that simply
reads the content from a webform (textarea), lowercase and sorts the lines
in it and print out the sorted line.

The webform has UTF-8 as charset and its content is retrieved via apache
and the CGI-module to perl. Then I use the standard lc() and
lower()-functions to process the text.
For I'm using perl 5.8.0 I didnt specify any UTF-pragma.

Now I state that lc() does not lower german umlauts and sort() puts german
umlauts after Z.

My question now is : How can I tell perl (inside my script) that it should
use "german" conventions for sorting and lowering. Is this possible at all ?

The script is online under
http://www.goldfisch.at/cgi-bin/unicodetest7.pl

The source is below. If I also print out the length of the string using the
length()-function I notice that each german umlaut increases length() by
two, so I was thinking that it is expressed as two-bytes char and I also
wondered why length() does not count the chars, but the bytes. But as you
already mentioned : I'm still not exactely clear what I'm talking about at
all :(
Maybe I should need to transform the text that is delivered by CGI-module
to "real" unicode before or whatever.


thnx a lot for any help,
peter


------------------------
#!/usr/local/bin/perl

use CGI;
use strict;

my $query = new CGI;
my $charset = 'UTF-8';

# print header
print $query->header(-type=>'text/html;
charset='.$charset),$query->start_html(-title=>'Unicodetest');

# read, process, print the form-content
if ($query->param('submit'))
{
foreach (sort(map {lc $_} split(/\n/,$query->param('unicode')))) {
print $_,"<br>\n"; # print the lowered string
# print length($_)-1,"<br>\n"; # print the length
}
}

# print the form
print '<br><br>enter your unicode-testtext here :
',$query->start_multipart_form,
$query->textarea(-name=>'unicode',-rows=>10,-columns=>100),
"\n<br>\n",
$query->submit(-name=>'submit',-value=>'proceed'),"\n",
$query->endform,"\n";


print $query->end_html;
 
A

Alan J. Flavell

Ok - I tried to reduce the problem and wrote a small script that simply
reads the content from a webform (textarea), lowercase and sorts the lines
in it and print out the sorted line.

Excellent, thanks.

I'm only sorry that I don't have a complete answer for you, but I
think I've found a number of issues that are relevant and may help to
get you there.

First some general issues:

- earlier versions of CGI.pm evidently deliver the data as just a
bunch of bytes, even when they're utf-8. One doesn't notice this
externally, since the bytes get written as-is to standard output and
HTML gets told they are utf-8 and so the expected correct result is
seen on the browser. However, any actual character-level processing
that's attempted in Perl will be unsuccessful, since it will treat
each byte as an individual character.

I got this effect for example with the CGI.pm version that came with a
recent ActivePerl (CGI.pm 2.81, I think it was). Make a practice of
printing $CGI::VERSION somewhere in your output, so that you know
what you're getting. Install a private copy of a recent version and
access it via "use lib ...".

According to the notes, http://stein.cshl.org/WWW/software/CGI/#new ,
it appears that this was changed in version 2.92, and one can (if
using a recent enough Perl, of course) expect to get genuine Perl
Unicode characters as data now in this situation. I haven't had time
to go into the details of how it relates to your specific problem,
sorry.

The other issue is that while reading about Unicode support in
relation to "use locale", there are dark hints that there may be
problems. Since you're evidently aiming to do language-specific
sorting (though quite what it'll do when you ask it to sort Greek and
Cyrillic according to German language rules, I don't know), you
may want to follow this up, at least if you get problems:
http://www.perldoc.com/perl5.8.0/pod/perlunicode.html#Locales
My question now is : How can I tell perl (inside my script) that it should
use "german" conventions for sorting and lowering. Is this possible at all ?

The answer to that appears to be
http://www.perldoc.com/perl5.8.0/pod/perllocale.html#The-use-locale-pragma
(et seq.)
The source is below. If I also print out the length of the string using the
length()-function I notice that each german umlaut increases length() by
two,

This is indicative, I think, that you are using an older version of
CGI.pm. Your particular character occupies 2 bytes in utf-8, and Perl
is counting two bytes instead of one character. Worse, if I try to
lower-case the *character* then it can produce garbage, because Perl
attempts to lower-case each byte separately!
Maybe I should need to transform the text that is delivered by CGI-module
to "real" unicode before or whatever.

I think that would be a possible workaround[1], but I think the
correct solution should be to use a corrected version of CGI.pm.
However, I haven't had time to try it, so YMMV.

[1] See e.g my reply in message-id
<[email protected]>
(that should give the spammers something to chomp on - it's amusing
how often I see mail logs refusing mail addressed to a usenet
message-id ;-)

Remember, if you decide to actually use Perl's native unicode support,
then you will need to apply the :utf8 layer to your output also (for
standard output, which is already open, you'd do that with binmode).

Now to some specifics...
#!/usr/local/bin/perl

use CGI;
use strict;

(I don't see 'warnings'... myself I'd tend also to use -T as a matter
of course in any CGI situation.)
my $query = new CGI;
my $charset = 'UTF-8';

# print header
print $query->header(-type=>'text/html;
charset='.$charset),

CGI.pm recognises -charset=>$charset here, you don't need to smuggle
it into the -type: see http://stein.cshl.org/WWW/software/CGI/#header
$query->start_html(-title=>'Unicodetest');

This is rather unfortunate: looking at the generated result (with that
earlier version of CGI.pm, anyway), it seems it defaults to iso-8859-1
for its (X)HTML generation, and indeed this is what the documentation
says it will do:

The -encoding argument can be used to specify the character set for
XHTML. It defaults to iso-8859-1 if not specified.

Why it wouldn't default to the charset that you supplied on the
header() call, I don't know.


Does that get you any further? I hope so. Must get back to worm
fighting now, there's a bunch of bogus virus alerts hitting our users
from misguided virus scanners out there, as a byproduct of the current
worm crop (Gibe-F a.k.a Swen). It's always less work to catch those
bogus alerts in the mailer, than to calm the distressed users down
afterwards. The worms themselves are by comparison easy to block!
 
P

peter pilsl

Alan J. Flavell wrote:

Does that get you any further? I hope so. Must get back to worm
fighting now, there's a bunch of bogus virus alerts hitting our users
from misguided virus scanners out there, as a byproduct of the current
worm crop (Gibe-F a.k.a Swen). It's always less work to catch those
bogus alerts in the mailer, than to calm the distressed users down
afterwards. The worms themselves are by comparison easy to block!

:) ack. thnx you found time to give me a hand anyway. Got a lot of useful
information from you. thnx a lot.

I now even more reduced my problem to pure perl - without any CGI or SQL.
(Lets say : I isolated one of my problems :)

I wrote a small script (source at the end), that produces some text and
tries to sort it. I was suprised that the sorting-order depends on the
stuff to be sorted :)


In my first attempt the order was like I expected. Then I added one more
text (a smiley) to the array and the sorting-order was wrong again. This is
very interesing. (see HERE1 in the source)

A second interesting thing is, that the result of the ordering depends on
the used locale. It makes a reseaonable difference if I use "de_AT" as
LC_CTYPE or using "de_AT.UTF-8". In the first case, ordering and
lowercasing works fine. In the second case, ordering and lowercasing
produces shit. (see HERE2 in the source)

Other experiences from my script: the "use local"-pragma is important, but
it does not matter which locale one uses (In fact even "korean" was ok) as
long as it does not contain "UTF-8" ;)

So all this unicode-stuff seems very chaotic to me. Minimum changes in the
input produces unpredictable changes in the output. :) At the end I should
set up a big database containing 100.000 entries that should be fully
searchable ;) And I struggle with sorting in a 10-liner ;)

-----------------------------------------
#!/usr/bin/perl -w
use strict;
use locale;
use POSIX qw(locale_h);

print setlocale(LC_CTYPE),"\n";
setlocale(LC_CTYPE,"de_AT");
#setlocale(LC_CTYPE,"de_AT.UTF-8"); # <=== UNCOMMENT HERE2 !!!
print setlocale(LC_CTYPE),"\n";
#binmode(STDOUT,":utf8"); # I uncommented this, cause it looks
# better on my terminal

my @s;
p("\x{00e4} this is lower german umlaut-a");
p("\x{00c4} this is upper german umlaut-A");
p("\x{00d6} this is lower german umlaut-o");
p("\x{00f6} this is upper german umlaut-O");
p("a this is a lower a");
p("A this is a upper a");
p("z this is a lower z");
p("b this is a lower b");
p("B this is a upper B");
p("p this is a lower p");
p("P this is a upper P");
p("o this is a lower o");
p("O this is a upper O");
#p("\x{263a} this is a smiley"); # <===== UNCOMMENT HERE1 !!!

@s=sort map {lc($_)." (l=".length($_).")"} @s;
print join("\n",@s),"\n";

sub p{
push(@s,shift);
}
----------------------------------



So I'm left confused. Another thing is, that I dont know how to get the
unicode-code from a given string: lets say:
$a="\x{263a}" and I want to get the 263a back from $a. This could help me
to understand the stuff delivered from CGI. Using the locale-pragma and
the above script on data delivered by my webscript didnt not work. It seems
as if CGI.pm does not deliver real unicode (although I use recent version
2.98)

I would be very pleased to hear from other people if they managed to run
bigger projects that include demanding tasks like sorting and so on relying
completely on unicode.

peter (who should have become a gardener or whatever ;)

ps: I cant say enough thnx to Alan and wonder if other people around here
are working with unicode in perl too. I googled alot on my questions and it
seems that there still is many confusion about unicode :)
 
A

Alan J. Flavell

In my first attempt the order was like I expected. Then I added one more
text (a smiley) to the array and the sorting-order was wrong again.

This sounds like a consequence of Perl staying with 8-bit
representation as long as possible, and only upgrading to Unicode when
it has no alternative. See
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Perl's-Unicode-Model

So my hunch is that until you added the smiley, the sorting is working
just like it always did; but when you added the smiley, the whole
Unicode thing got switched on.

You'd get the same effect by adding a Greek or Cyrillic character -
anything that can't be expressed in the base 8-bit coding (iso-8859-1
in your case). Conversely, I guess if you taught Perl to work in
Greek (iso-8859-7) then it would switch into Unicode mode when you
added your German umlaut (but don't quote me on that).
A second interesting thing is, that the result of the ordering depends on
the used locale. It makes a reseaonable difference if I use "de_AT" as
LC_CTYPE or using "de_AT.UTF-8". In the first case, ordering and
lowercasing works fine. In the second case, ordering and lowercasing

The perllocale page gives a list of pre-requisites before a locale can
work. Maybe you haven't got this locale available on your platform?
This area is something I really haven't explored yet, to be honest.
So all this unicode-stuff seems very chaotic to me.

It's organised somehow under the covers! It only _seems_ to be
chaotic, I reckon.

It's a pity we don't have a regular contributor here who really
understands this stuff, so you're stuck with me who, more as a hobby
than a profession, learns it piece by piece as each new angle comes
up.
So I'm left confused. Another thing is, that I dont know how to get the
unicode-code from a given string:

ord() works for unicode, it takes a Perl character (i.e not just a
byte!) as argument, and delivers an integer result. How you then
print that out (decimal, hex, whatever) is up to you. Unicode
themselves conventionally work in hex, as you know.

And if you feed that integer number to chr() then you get Perl's
unicode character out (unless, as I say, it decides it can get away
with just using 8-bit characters).

Those are just the customary Perl functions, transparently upgraded to
support unicode. And you can use substr() and so on, applied to
Perl's Unicode strings, it doesn't need any new syntax there.

hope that helps
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,708
Latest member
SherleneF1

Latest Threads

Top