XML::PARSER utf-8 and japanese characters

H

Hemant Shah

Folks,

I am having problem writing Japanese characters.

I am parsing an XML document that is in utf-8, actually it is a
content.xml file from Open Office. It contains Japanese text along
with english text. (english text and it's japanese translation).

I want to write the the english and japanese text into individual
files.

Another process will read these individual files and insert the it
into DB2 database which is also in utf-8.

I am having problem writing japanese text to a file.

I am running perl 5.8.3 on AIX 5.2.

Here are the code fragments from my script:


use Encode;
use encoding utf8, STDOUT => "utf8", STDIN => "utf8";
use XML::parser;


$ContentParser = new XML::parser(Handlers => {Start => \&HandleContentStart,
End => \&HandleContentEnd,
Default => \&DefaultContentHandler,
Char => \&HandleContentChar});

$ContentParser->parsefile ("content.xml", ProtocolEncoding => 'UTF-8');



# In HandleContentChar() subroutine
open (TEMPFILE, ">:encoding(utf8)", $TmpFile) ||
die "Cannot open temporary file for write $TmpFile. $!";

# Code to print XML tags

print TEMPFILE "$JapaneseText";

# Code to print XML tags


close(TEMPFILE);


When I look at the Japanese text in content.xml file and $TmpFile (hex dump),
they are different.



Also is there a way to split the Japanese text at unicode character
boundary. I would like to store lines of 100 (single byte) characters or
less per line. I do not have any problem with english and spanish text,
but japanese characters are double byte, so I would like to split the
line at 50 japanese characters.


Thanks in advance.







--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: (e-mail address removed) \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 
B

Ben Morrow

Quoth (e-mail address removed):
I am having problem writing Japanese characters.

I am parsing an XML document that is in utf-8, actually it is a
content.xml file from Open Office. It contains Japanese text along
with english text. (english text and it's japanese translation).

I want to write the the english and japanese text into individual
files.

Another process will read these individual files and insert the it
into DB2 database which is also in utf-8.

I am having problem writing japanese text to a file.

I am running perl 5.8.3 on AIX 5.2.

That's a good start...
Here are the code fragments from my script:

use Encode;
use encoding utf8, STDOUT => "utf8", STDIN => "utf8";

I would have explicitly binmoded the FHs, for clarity, but hey...
use XML::parser;


$ContentParser = new XML::parser(Handlers => {Start => \&HandleContentStart,
End => \&HandleContentEnd,
Default => \&DefaultContentHandler,
Char => \&HandleContentChar});

$ContentParser->parsefile ("content.xml", ProtocolEncoding => 'UTF-8');



# In HandleContentChar() subroutine
open (TEMPFILE, ">:encoding(utf8)", $TmpFile) ||

Use lexical filehandles.
Use low-precedence operators to avoid brackets.

open my $TEMFILE, '>:encoding(utf8)', $TmpFile or die ...;
die "Cannot open temporary file for write $TmpFile. $!";

# Code to print XML tags

print TEMPFILE "$JapaneseText";

Don't quote unnecessarily.
# Code to print XML tags

close(TEMPFILE);

When I look at the Japanese text in content.xml file and $TmpFile (hex dump),
they are different.

How are they different? Are they equivalent representations of the text
(I don't know if there are any non-canonical representations for
Japanese)? Can you give some examples of input and output text?
Also is there a way to split the Japanese text at unicode character
boundary. I would like to store lines of 100 (single byte) characters or
less per line. I do not have any problem with english and spanish text,
but japanese characters are double byte,

No they aren't. Most Japanese characters require 3 bytes in the UTF8
encoding, and all accented spanish characters will require at least two.
so I would like to split the line at 50 japanese characters.

What do you actually mean here? You claim not to mean 100 bytes/line,
but I suspect that might be what you actually want (if this is for some
program with a line-length limitation). Otherwise, do you mean 100
Unicode codepoints (100 complete utf8 sequences), 100 graphemes
(sequences like {LATIN SMALL LETTER A}{COMBINING ACUTE ACCENT}
which, while two Unicode codepoints, display as one character) or 100
(displayed) columns? These can by done by:

$string =~ s/(.{100})/$1\n/g; # CHARS (CODEPOINTS)

$string =~ s/(\X{100})/$1\n/g; # GRAPHEMES (COMBINING SEQUENCES)

; 'bytes' and 'columns' are slightly harder, and I can't see an easy way
to do them with a regex:

# BYTES

{
my $newstring = '';
my $width = 0;

for (split //, $string) {
$width += do { use bytes; length };
$width > 100 and $newstring .= "\n", $width -= 100;
$newstring .= $_;
}

$string = $newstring;
}

# COLUMNS (taking CJK full-width forms into account)

use Unicode::EastAsianWidth; # install from CPAN

{
my $newstring = '';
my $width = 0;

for (split //, $string) {
/\p{IsPrint}/ and $width += /\p{InFullwidth}/ ? 2 : 1;
# There is a bug here: it doesn't deal correctly with
# printing-but-not-spacing characters (like combining accents).

$width > 100 and $newstring .= "\n", $width -= 100;
$newstring .= $_;
}

$string = $newstring;
}

<none of the above tested>. You will need to read the docs for
Unicode::EastAsianWidth if you use it: I don't fully understand what it
says about 'ambiguous width' characters, knowing very little about CJK
writing.

Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top