UTF-8 and Spreadsheet::ParseExcel

R

roberto0

Hello,

I'm trying to parse a large number of multilingual Excel sheets such
that I can load much of the data into an Oracle database. The problem
is that there are a number of UTF-8 characters that are not recognized
as "chars" by the DB and we need those fields to be searchable. The DB
requirement is for my script to generate ASCII characters and/or
transliterations from those UTF-8 characters. In other words, the DB
people want "alpha" to replace the UTF-8 {GREEK SMALL LETTER ALPHA}.

This is all fine and good and I have scripts that do this rather well
for Unicode or other UTF-8 files. The problem arises when I use
Spreadsheet::parseExcel to read MS Excel files. It seems that the
parser only picks up the last half of the character. (last 4 bytes of
the 8-byte character, I think) It then becomes impossible to
differentiate between certain UTF8 characters since many have the same
second half.

for example the UTF8 symbols for {MICRO SYMBOL} and {GREEK SMALL LETTER
EPSILON} are gleaned from ParseExcel as <B5>. When I parse the same
symbols from a plain unicode text file, each character is reported as
<A3><B5> and <21><B5> respectively.

I know ParseExcel uses OLE::Storage as its interface. Could the
problem lie there?
 
R

roberto0

acutally, the MICRO SIGN is just <B5> and and GREEK SMALL LETTER
EPSILON is <CE><B5>.

Someone suggested that the context of the files I'm parsing may be the
key to determining the answer to my problem. However, the files I'm
parsing aren't perfect, and the less I rely on the context, the better.


Thanks in advance for any tips or advice,

roberto0
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,699
Latest member
AnneRosen

Latest Threads

Top