Does unpack() support higher-order Unicode strings for hex conversion?

fhscobey · Nov 3, 2005

Hi,
I've been playing around with the unpack() function to create literal
wide-hex strings in the form of '\x{}', to represent UTF-8 strings in
an application I support. I'm basically following the advice outlined
in the following perldoc:

http://search.cpan.org/~jhi/perl-5.8.0/pod/perluniintro.pod#Creating_Unicode

But, one thing I noticed is that for higher order values (>0xFF I
guess), unpack does not return the proper hex representation for the
Unicode code point for the character provided. Here is an example.

Example 1:
I have a flat file called 'utf8.string', which contains the followng
Polish string "Pozostale" (which means "Other" in English). If I run
the following, I get the output you see below:

$ cat utf8.string | perl -e 'binmode(STDIN,":utf8");
binmode(STDOUT,":utf8"); while(<STDIN>){$line=$_; chomp($line);
@raw_chars=split(//,$line); foreach $ch
(@raw_chars){$unpacked_char=unpack("H*",$ch);
push(@unpacked_chars,$unpacked_char);} foreach $ch
(@unpacked_chars){print("unpacked char = " . $ch,"\n");}}'
unpacked char = 50
unpacked char = 6f
unpacked char = 7a
unpacked char = 6f
unpacked char = 73
unpacked char = 74
unpacked char = 61
unpacked char = c582
unpacked char = 65

Notice that all hex values for all chars look OK, except for the second
to last. The 'l' character is getting converted to 0xc582, which is
incorrect. I know from referencing the Unicode documentation at:
http://www.unicode.org/charts/PDF/U0100.pdf
.... the correct code point is 0x142.

Is this what is supposed to happen? I haven't seen anything in the
documentation that says the 'H' template for unpack cannot be used for
higher-order unicode characters. Did I miss something?

I can work around this by using the "U" template unpack the chars, and
then putting the decimal values through:
sprintf("%X", $dec_value);
....to get the correct code point hex value, but I was under the
impression that unpack() was supposed to be able to do that by itself.

Here is a sample of how I get the correct hex value:

$ cat test_utf8_string.3.utf8 | perl -e 'binmode(STDIN,":utf8");
binmode(STDOUT,":utf8"); while(<STDIN>){$line=$_; chomp($line);
@unpacked_chars=unpack("U*",$line); foreach $ch
(@unpacked_chars){print("unpacked char decimal = " . $ch, " / converted
to hex = " . sprintf("%X",$ch),"\n");}}'
unpacked char decimal = 80 / converted to hex = 50
unpacked char decimal = 111 / converted to hex = 6F
unpacked char decimal = 122 / converted to hex = 7A
unpacked char decimal = 111 / converted to hex = 6F
unpacked char decimal = 115 / converted to hex = 73
unpacked char decimal = 116 / converted to hex = 74
unpacked char decimal = 97 / converted to hex = 61
unpacked char decimal = 322 / converted to hex = 142
unpacked char decimal = 101 / converted to hex = 65

Just wondering if I'm using unpack() incorrectly, or if my
understanding that it should be able to handle higher-order unicode
characters when converting to hex format, is incorrect.

I'm on RedHat Linux 7.2, Perl 5.8.1.

Thanks for any assistance you can offer.
- Jeff

Unicode: Strings marked 'utf8'. Can they be converted to 'byte' without going the vec() route?	0	Aug 3, 2009
Hex strings treated differently when read from STDIN?	4	Nov 1, 2005
unicode: equal strings give different results?	2	Sep 27, 2004
Error in Handling Unicode(UTF16-LE) File & String	4	May 6, 2008
retriving escape unicode sequences from files ...	1	Aug 4, 2012
retriving escape unicode sequences from files ...	1	Aug 4, 2012
Higher-Order Perl: The Quest for an Accessible Version, Initial Impressions (Long)	2	Apr 2, 2006
pack and unpack question for perl 5.8.0	0	Jul 28, 2003

Does unpack() support higher-order Unicode strings for hex conversion?

fhscobey

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads