Does unpack() support higher-order Unicode strings for hex conversion?

F

fhscobey

Hi,
I've been playing around with the unpack() function to create literal
wide-hex strings in the form of '\x{}', to represent UTF-8 strings in
an application I support. I'm basically following the advice outlined
in the following perldoc:

http://search.cpan.org/~jhi/perl-5.8.0/pod/perluniintro.pod#Creating_Unicode

But, one thing I noticed is that for higher order values (>0xFF I
guess), unpack does not return the proper hex representation for the
Unicode code point for the character provided. Here is an example.

Example 1:
I have a flat file called 'utf8.string', which contains the followng
Polish string "Pozostale" (which means "Other" in English). If I run
the following, I get the output you see below:

$ cat utf8.string | perl -e 'binmode(STDIN,":utf8");
binmode(STDOUT,":utf8"); while(<STDIN>){$line=$_; chomp($line);
@raw_chars=split(//,$line); foreach $ch
(@raw_chars){$unpacked_char=unpack("H*",$ch);
push(@unpacked_chars,$unpacked_char);} foreach $ch
(@unpacked_chars){print("unpacked char = " . $ch,"\n");}}'
unpacked char = 50
unpacked char = 6f
unpacked char = 7a
unpacked char = 6f
unpacked char = 73
unpacked char = 74
unpacked char = 61
unpacked char = c582
unpacked char = 65

Notice that all hex values for all chars look OK, except for the second
to last. The 'l' character is getting converted to 0xc582, which is
incorrect. I know from referencing the Unicode documentation at:
http://www.unicode.org/charts/PDF/U0100.pdf
.... the correct code point is 0x142.

Is this what is supposed to happen? I haven't seen anything in the
documentation that says the 'H' template for unpack cannot be used for
higher-order unicode characters. Did I miss something?

I can work around this by using the "U" template unpack the chars, and
then putting the decimal values through:
sprintf("%X", $dec_value);
....to get the correct code point hex value, but I was under the
impression that unpack() was supposed to be able to do that by itself.

Here is a sample of how I get the correct hex value:

$ cat test_utf8_string.3.utf8 | perl -e 'binmode(STDIN,":utf8");
binmode(STDOUT,":utf8"); while(<STDIN>){$line=$_; chomp($line);
@unpacked_chars=unpack("U*",$line); foreach $ch
(@unpacked_chars){print("unpacked char decimal = " . $ch, " / converted
to hex = " . sprintf("%X",$ch),"\n");}}'
unpacked char decimal = 80 / converted to hex = 50
unpacked char decimal = 111 / converted to hex = 6F
unpacked char decimal = 122 / converted to hex = 7A
unpacked char decimal = 111 / converted to hex = 6F
unpacked char decimal = 115 / converted to hex = 73
unpacked char decimal = 116 / converted to hex = 74
unpacked char decimal = 97 / converted to hex = 61
unpacked char decimal = 322 / converted to hex = 142
unpacked char decimal = 101 / converted to hex = 65


Just wondering if I'm using unpack() incorrectly, or if my
understanding that it should be able to handle higher-order unicode
characters when converting to hex format, is incorrect.

I'm on RedHat Linux 7.2, Perl 5.8.1.

Thanks for any assistance you can offer.
- Jeff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,188
Members
46,731
Latest member
MarcyGipso

Latest Threads

Top