R
roberto0
Hello,
I'm trying to parse a large number of multilingual Excel sheets such
that I can load much of the data into an Oracle database. The problem
is that there are a number of UTF-8 characters that are not recognized
as "chars" by the DB and we need those fields to be searchable. The DB
requirement is for my script to generate ASCII characters and/or
transliterations from those UTF-8 characters. In other words, the DB
people want "alpha" to replace the UTF-8 {GREEK SMALL LETTER ALPHA}.
This is all fine and good and I have scripts that do this rather well
for Unicode or other UTF-8 files. The problem arises when I use
Spreadsheet:arseExcel to read MS Excel files. It seems that the
parser only picks up the last half of the character. (last 4 bytes of
the 8-byte character, I think) It then becomes impossible to
differentiate between certain UTF8 characters since many have the same
second half.
for example the UTF8 symbols for {MICRO SYMBOL} and {GREEK SMALL LETTER
EPSILON} are gleaned from ParseExcel as <B5>. When I parse the same
symbols from a plain unicode text file, each character is reported as
<A3><B5> and <21><B5> respectively.
I know ParseExcel uses OLE::Storage as its interface. Could the
problem lie there?
I'm trying to parse a large number of multilingual Excel sheets such
that I can load much of the data into an Oracle database. The problem
is that there are a number of UTF-8 characters that are not recognized
as "chars" by the DB and we need those fields to be searchable. The DB
requirement is for my script to generate ASCII characters and/or
transliterations from those UTF-8 characters. In other words, the DB
people want "alpha" to replace the UTF-8 {GREEK SMALL LETTER ALPHA}.
This is all fine and good and I have scripts that do this rather well
for Unicode or other UTF-8 files. The problem arises when I use
Spreadsheet:arseExcel to read MS Excel files. It seems that the
parser only picks up the last half of the character. (last 4 bytes of
the 8-byte character, I think) It then becomes impossible to
differentiate between certain UTF8 characters since many have the same
second half.
for example the UTF8 symbols for {MICRO SYMBOL} and {GREEK SMALL LETTER
EPSILON} are gleaned from ParseExcel as <B5>. When I parse the same
symbols from a plain unicode text file, each character is reported as
<A3><B5> and <21><B5> respectively.
I know ParseExcel uses OLE::Storage as its interface. Could the
problem lie there?